Skip to main content
Log in

Dependency structure annotation in the IULA Spanish LSP Treebank

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper presents the IULA Spanish LSP Treebank, an open-source treebank of over 40,000 sentences, developed in the framework of the European project METANET4U. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level, following the dependency grammar theory. We present the method we used to create the resource and the linguistic annotations that the treebank provides, using examples and comparing with similar resources. We also provide the statistics of the treebank and the evaluation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Institut Universitari de Lingüística Aplicada.

  2. Language for Special Purposes.

  3. Enhancing the European Linguistic Infrastructure (GA 270893GA). http://www.metanet.eu/projects/METANET4U/.

  4. http://iula05v.upf.edu/TreebankBrowser.

  5. http://metashare.upf.edu and http://hdl.handle.net/10230/20408. Note that, even though dependency structure annotations are derived from an HPSG treebank, as we will explained in Sect. 3, only dependency annotations are released.

  6. http://creativecommons.org/licenses/by/3.0/.

  7. This corpus is accessible with a browser that provides concordance-based search functions at http://bwananet.iula.upf.edu/bwananet1a.en.htm.

  8. To learn about the contexts that were missed by selecting the sentences at random the annotators accessed the complete texts.

  9. http://nlp.lsi.upc.edu/freeling/.

  10. See http://www.ilc.cnr.it/EAGLES96/annotate/annotate.html.

  11. The integration of the PoS tags associated to each word was done using the LKB Simple PreProcessor Protocol (SPPP; http://wiki.delph-in.net/moin/LkbSppp). Figure 3 shows how PoS tags are associated to each lexical item in the derivation tree.

  12. http://www.delph-in.net.

  13. Human annotation is not only more expensive and slower, but it also introduces more errors and inconsistencies, because of the difficulty and tiring nature of the task.

  14. Note that the grammar still lacks coverage for some constructions that appeared in the corpus.

  15. FreeLing errors could not be corrected and caused about 5 % of sentences to be rejected.

  16. We also developed a parser ensemble approach to select the linguistic analyses produced by SRG automatically using full agreement among the MaxEnt parse selection model and the MaltParser dependency parser (Nivre et al. 2007), which will be used in the near future to enlarge the treebank. We have performed some experiments using a set 1,428 sentences and obtained the following results: 445 sentences were selected out of the 1,428 sentences (31.2 %), precision (number of correctly selected sentences among all the selected sentences) stood at 90.6 % (403/445), while recall (number of correctly selected sentences among all the actually correctly ranked first sentences) was 46.6 % (403/864). See further details in Marimon et al. (2014).

  17. Note that all instances of a given NE type (i.e. proper names, dates,...) are assigned the same lexical entry identifier, which means that there is only one lexical entry for each type. See Marimon (2013) for further details.

  18. Completive clauses may also function as SUBJ, as in example (1).

    1. (1)

      Sería preferible que

      be.CONDITIONAL.3RD.SG preferable that

      estuviesen más acordes.

      be.SUBJUNCTIVE.PRESENT.3RD.PL more consistent

      ‘It would be preferable that they were more consistent.’

  19. The dependency labels CONJ, COORD, and ENUM, used in coordinated constructions, and SUBJ-GAP, COMP-GAP, and MOD-GAP, for subjects, complements, and modifiers in gapping constructions, will be discussed in the following subsections.

  20. Spanish clitic pronouns can appear either attached to the right side of the host verb verb, the so-called enclitics, or as independent lexical units in front of the verb, known as proclitics. Infinitives, gerunds, and non-negated imperatives have enclitics, verbs in personal forms always require proclitics, and past participles cannot have clitics. As in the Spanish treebank described by McDonald et al. (2013), enclitics are not split and only proclitics are annotated.

  21. A similar analysis is provided in AnCora (Taulé et al. 2008), where a takes the label DO and the head of the NP takes the the label “NP”, since the complements of a non-verbal head do not get a functional label in this treebank.

  22. In passive constructions the verb has a unique argument which is the syntactic subject.

  23. Zabokrtsky et al. (2013) describe some known problems related to coordination structure and present a taxonomy of various formal means developed for encoding it in dependency treebanks based on observations from a set of dependency treebanks for 26 languages.

  24. In addition, evaluation results using the treebank with different parsing systems are discussed in (Padró et al. 2013).

References

  • Abeillé, A. (Ed.). (2003). Treebanks: Building and using parsed corpora. Dordrecht: Kluwer Academic Publishers.

  • Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., de Ilarraza, A. D., Garmendia, A., et al. (2003). Construction of a Basque dependency treebank. In Proceedings of TLT-2003, Växjö, Sweden, pp. 201–204.

  • Afonso, S., Bick, E., Haber, R., & Santos, D. (2002). Floresta sint’a(c)tica: A treebank for Portuguese. In Proceedings of LREC-2002 Las Palmas de Gran Canaria. Spain, pp. 1968–1703.

  • Bangalore, S. (2003). Localizing dependencies and supertagging. In R. Bod, R. Scha, & K. Sima’an (Eds.), Data-oriented parsing (pp. 283–298). Chicago: CSLI Publications, University of Chicago Press.

    Google Scholar 

  • Böhmová, A., Hajic, J., Hajicová, E., & Hladká, B. (2003). The PDT: A 3-level annotation scenario. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 103–127). Dordrecht: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Bosco, C., Lombardo, V., Vassallo, D., & Lesmo, L. (2000). Building a treebank for Italian: a data-driven annotation schema. In Proceedings of LREC-2000, Athens, Greece.

  • Branco, A., Costa, F., Silva, J., Silveira, S., Castro, S., Avelãs, M., et al. (2010). Developing a deep linguistic databank supporting a collection of treebanks: The CINTIL DeepGramBank. In Proceedings of LREC-2010, La Valletta, Malta.

  • Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., WolfgangLezius, G. S., et al. (2004). TIGER: linguistic interpretation of a German Corpus. In E. Hinrichs & K. Simov (Eds.), Research on language and computation (Vol. 2, pp. 597–619). Berlin: Springer.

    Google Scholar 

  • Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL-X, New York City, USA.

  • Cabré, M. T., Bach, C., & Vivaldi, J. (2006). 10 anys del Corpus de l’IULA. Barcelona: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra.

    Google Scholar 

  • Carter, D. (1997). The TreeBanker: A tool for supervised training of parsed corpora. In Proceedings of AAAI-97, Providence, Rhode Island, pp. 598–603.

  • Chaitanya, G., Husain, S., & Mannem, P. (2011). Empty categories in Hindi dependency treebank: Analysis and recovery. In Proceedings of LAW V, Portland, USA.

  • Collins, M., Hajič, J., Ramshaw, L., & Tillmann, C. (1999). A statistical parser for Czech. In Proceedings of ACL-1999, pp. 505–512.

  • Copestake, A. (2002). Implementing typed feature structure grammars. Stanford: CSLI Publications.

    Google Scholar 

  • Copestake, A., Flickinger, D., Pollard, C., & Sag, I. A. (2006). Minimal recursion semantics: An introduction. Research on Language and Computation, 3(4), 281–332.

    Google Scholar 

  • Covington, M. (2001). A fundamental algorithm for dependency parsing. In Proceedings of the 39th annual ACM southeast conference, pp. 95–102.

  • Dzeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Zabokrtsky, Z., & Zele, A. (2006). Towards a Slovene dependency treebank. In Proceedings of LREC-2006, Genoa, Italy.

  • Eisner, J. M. (1996a). An empirical comparison of probability models for dependency grammar IRCS-96-11. Technical report, Institute for Research in Cognitive Science, University of Pennsylvania.

  • Eisner, J. M. (1996b). Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING 1996, Copenhagen, Denmark, pp. 340–345.

  • Eisner, J. M. (2000). Bilexical grammars and their cubic-time parsing algorithms. In H. Bunt & A. Nijholt (Eds.), Advances in probabilistic and other parsing technologies (pp. 29–62). Dordrecht: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., et al. (2012). ParDeepBank: Multiple parallel deep treebanking. In Proceedings of TLT-2012, Lisbon, Portugal, pp. 97–108.

  • Garside, R., Leech, G., & Váradi, T. (1992). Lancaster parsed corpus. A machine-readable syntactically analyzed corpus of 144,000 words, available for distribution through ICAME. Bergen: The Norwegian Computing Centre for the Humanities.

  • Harper, M. P., & Helzerman, R. A. (1995). Extensions to constraint dependency parsing for spoken language processing. Computer Speech and Language, 9, 187–234.

    Article  Google Scholar 

  • Hashimoto, C., Bond, F., & Siegel, M. (2007). Semi-automatic documentation of an implemented linguistic grammar augmented with a treebank. Language Resources and Evaluation (Special issue on Asian language technology), 42(2), 117–126.

  • Hellwig, P. (1986). Dependency unification grammar. In Proceedings of COLING 1986, Bonn, Germany, pp. 195–198.

  • Hellwig, P. (2003). Dependency unification grammar. In V. Agel, L. M. Eichinger, H. W. Eroms, P. Hellwig, H. J. Heringer, & H. Lobin (Eds.), Dependency and valency (pp. 593–635). Berlin: Walter de Gruyter.

    Google Scholar 

  • Hudson, R. A. (1984). Word grammar. Oxford: Blackwell.

    Google Scholar 

  • Hudson, R. A. (1990). English word grammar. Cambridge: Blackwell.

    Google Scholar 

  • Husain, S., Mannem, P., Ambati, B., & Gadde, P. (2010). The ICON-2010 tools contest on Indian language dependency parsing. In Proceedings of ICON-2010 tools contest on Indian language dependency parsing, Kharagpur, India, pp. 1–8.

  • Järvinen, T., & Tapanainen, P. (1998). Towards an implementable dependency grammar. In Proceedings of the workshop on processing dependency-based grammars (ACL-COLING), Montreal, Canada, pp. 1–10.

  • Kakkonen, T. (2005). Dependency treebanks: Methods, annotation schemes and tools. In Proceedings of NODALIDA 2005, Joensuu, Finland, pp. 94–104.

  • Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. In Proceedings of COLING 1990, Helsinki, Finland.

  • Karlsson, F., Voutilainen, A., Heikkil, J., & Anttila, A. (1995). Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.

    Book  Google Scholar 

  • Kordoni, V., & Zhang, Y. (2009). Annotating wall street journal texts using a hand-crafted deep linguistic grammar. In Proceedings of LAW III, Suntec, Singapore.

  • Kromann, M. T. (2003). The Danish Dependency Treebank and the DTAG treebank tool. In Proceedings of TLT-2003, Växjö, pp. 217–220.

  • Kübler, S., McDonald, R., & Nivre, J. (2009). Dependency Parsing. In G. Hirst (Ed.), Synthesis Lectures on Human Language Technologies. Los Altos: Morgan and Claypool Publishers.

    Google Scholar 

  • Marcus, M., Beatrice, S., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330.

    Google Scholar 

  • Marimon, M. (2010). The Tibidabo treebank. Procesamiento del Lenguaje Natural, 45, 113–119.

    Google Scholar 

  • Marimon, M. (2013). The Spanish DELPH-IN grammar. Language Resources and Evaluation, 47(2), 371–397.

    Article  Google Scholar 

  • Marimon, M., Bel, N., & Padró, L. (2014). Automatic selection of HPSG-parsed sentences for Treebank construction. Computational Linguistics, 40(3).

  • Maruyama, H. (1990). Structural disambiguation with constraint propagation. In Proceedings of ACL 1990, Pittsburgh, PA, pp. 31–38.

  • McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Proceedings of ACL 2005, University of Michigan, USA, pp. 91–98.

  • McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the ACL 2013, Sofia, Bulgaria, pp. 92–97.

  • Mel’čuk, I. (1988). Dependency syntax: theory and practice. New York: State University of New York Press.

    Google Scholar 

  • Menzel, W., & Schröder, I. (1998). Decision procedures for dependency parsing using graded constraints. In Proceedings of the workshop on processing of dependency-based grammars (ACL-COLING), Montreal, Canada, pp. 78–87.

  • Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A., et al. (2003). Building the Italian syntactic-semantic treebank. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 189–210). Dordrecht: Kluwer Academic Publishers.

    Chapter  Google Scholar 

  • Moreno, A., Grishman, R., López, S., Sánchez, F., & Sekine, S. (2000). Treebank of Spanish and its application to parsing. In Proceedings of LREC-2000, Athens, Greece.

  • Nasr, A., & Rambow, O. (2004). A simple string-rewriting formalism for dependency grammar. In Proceedings of the workshop on recent advances in dependency grammar (COLING), Geneva, Switzerland, pp. 25–32

  • Nilsson, J., Hall, J., & Nivre, J. (2005). MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. In Proceedings of the special session on Treebanks (NODALIDA 2005), Finland, Joensuu, pp. 119–132

  • Nilsson, J., Nivre, J., & Hall, J. (2006). Graph transformations in data-driven dependency parsing. In Proceedings of ACL 2006, Sydney, Australia, pp. 257–264.

  • Nivre, J. (2005). Dependency grammar and dependency parsing, MSI report 05133. Technical report, Växjö University: School of Mathematics and Systems Engineering.

  • Nivre, J., Hall, J., & Nilsson, J. (2004). Memory-based dependency parsing. In Proceedings of CoNLL-2004, Boston, MA, USA, pp. 59–56.

  • Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., et al. (2007). Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135.

    Google Scholar 

  • Oepen, S., & Carroll, J. (2000). Performance profiling for parser engineering. In D. Flickinger, S. Oepen, J.-I. Tsujii, & H. Uszkoreit (Eds.), Natural Language Engineering (Vol. 6, part 1)—special issue: Efficiency processing with HPSG: Methods, systems, evaluation, (pp. 81–97). Cambridge: Cambridge University Press.

  • Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2002). LinGo Redwoods. A rich and dynamic treebank for HPSG. In Proceedings of TLT-2002, Sozopol, Bulgaria, pp. 139–149.

  • Oflazer, K. (2003). Dependency parsing with an extended finite-state approach. Computational Linguistics, 29, 515–544.

    Article  Google Scholar 

  • Oflazer, K., Say, B., Hakkani-Tür, D. Z., & Tür, G. (2003). Building a Turkish Treebank. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora. Dordrecht: Kluwer Academic Publishers.

    Google Scholar 

  • Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards wider multilinguality. In Proceedings of LREC-2012, Istanbul, Turkey.

  • Padró, M., Ballesteros, M., Martínez H., & Bohnet, B. (2013). Finding dependency parsing limits over a large Spanish corpus. In Proceeding of the IJCNLP-2013.

  • Pollard, C., & Sag, I. A. (1987). Information-based syntax and semantics, Vol. I. Fundamentals. CSLI Lecture Notes, Stanford.

  • Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: The University of Chicago Press and CSLI Publications.

    Google Scholar 

  • Samuelsson, C. (2000). A statistical theory of dependency syntax. In Proceedings of COLING-2000, Saarbrücken, Germany, pp. 684–690

  • Seeker, W., & Kuhn, J. (2012). Making ellipses explicit in dependency conversion for a German treebank. In Proceeding of LREC-2012, Istanbul, Turkey.

  • Sgall, P., Hajičová, E., & Panevová, J. (1986). The meaning of the sentence in its pragmatic aspects. Dordrecht: Reidel.

    Google Scholar 

  • Simov, K., & Osenova, P. (2005). Extending the annotation of BulTreeBank: Phase 2. In Proceeding of TLT-2005, Barcelona, Spain, pp. 173–184.

  • Smrz, O., Bielicky, V., Kourilová, I., Krácmar, J., Hajic, J., & Zemánek, P. (2008). Prague Arabic dependency treebank: A word on the million words. In Proceedings of the workshop on Arabic and local languages, Marrakech, Morocco, pp. 16–23.

  • Taulé, M., Martí, M., & Recasens, M. (2008). AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In Proceedings of LREC-2008, Marrakech, Morocco.

  • Tesnière, L. (1959). Eléments de syntaxe structurale. Paris: Librairie Klincksieck.

    Google Scholar 

  • Toutanova, K., Manning, C. D., Flickinger, D., & Oepen, S. (2005). Stochastic HPSG parse disambiguation using the Redwoods corpus. Research on Language and Computation, 3(1), 83–105.

    Article  Google Scholar 

  • van der Beek, L., Bouma, G., Malouf, R., & van Noord, G. (2002). The Alpino dependency treebank. In Proceedings of CLIN-2001, Amsterdam, The Netherlands, pp. 8–22.

  • Vincze, V., Szauter, D., Almási, A., Móra, G., Alexin, Z., & Csirik, J. (2010). Hungarian dependency treebank. In Proceedings of LREC-2010, Valletta, Malta.

  • Vivaldi, J. (2009). Corpus and exploitation tool: IULACT and bwanaNet. In Actas del I Congreso Internacional de Lingüística de Corpus (CICL-09), Murcia, Spain, pp. 224–239.

  • Wang, W., & Harper, M. P. (2004). A statistical constraint dependency grammar (CDG) parser. In Proceedings of the workshop in incremental parsing: Bringing engineering and cognition together, pp. 42–49.

  • Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vector machines. In Proceedings of IWPT 2003, Nancy, France, pp. 195–206.

  • Zabokrtsky, Z., Stepanek, J., Popel, M., Zeman, D., & Marecek, D. (2013). Coordination structures in dependency treebanks. In Proceedings of ACL 2013, Sofia, Bulgaria, pp. 517–527.

Download references

Acknowledgments

This work was funded by Spanish Ministerio de Ciencia e Innovación under the programe Ramón y Cajal and the European project METANET4U. We would like to thank Blanca Arias, Beatriz Fisas, Mercé Lorente, Carlos Morell, Silvia Vázquez and Jorge Vivaldi for their participation in the METANET4U project, and three anonymous reviewers for their suggestions and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Montserrat Marimon.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Marimon, M., Bel, N. Dependency structure annotation in the IULA Spanish LSP Treebank. Lang Resources & Evaluation 49, 433–454 (2015). https://doi.org/10.1007/s10579-014-9280-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-014-9280-5

Keywords

Navigation