Dependency structure annotation in the IULA Spanish LSP Treebank

Marimon, Montserrat; Bel, Núria

doi:10.1007/s10579-014-9280-5

Dependency structure annotation in the IULA Spanish LSP Treebank

Original Paper
Published: 02 September 2014

Volume 49, pages 433–454, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Montserrat Marimon¹ &
Núria Bel²

280 Accesses
1 Citation
Explore all metrics

Abstract

This paper presents the IULA Spanish LSP Treebank, an open-source treebank of over 40,000 sentences, developed in the framework of the European project METANET4U. The IULA Spanish LSP Treebank is the first technical corpus of Spanish annotated at surface syntactic level, following the dependency grammar theory. We present the method we used to create the resource and the linguistic annotations that the treebank provides, using examples and comparing with similar resources. We also provide the statistics of the treebank and the evaluation results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prague Dependency Treebank

Sinica Treebank

Dependency Parsing of Turkish

Notes

Institut Universitari de Lingüística Aplicada.
Language for Special Purposes.
Enhancing the European Linguistic Infrastructure (GA 270893GA). http://www.metanet.eu/projects/METANET4U/.
http://iula05v.upf.edu/TreebankBrowser.
http://metashare.upf.edu and http://hdl.handle.net/10230/20408. Note that, even though dependency structure annotations are derived from an HPSG treebank, as we will explained in Sect. 3, only dependency annotations are released.
http://creativecommons.org/licenses/by/3.0/.
This corpus is accessible with a browser that provides concordance-based search functions at http://bwananet.iula.upf.edu/bwananet1a.en.htm.
To learn about the contexts that were missed by selecting the sentences at random the annotators accessed the complete texts.
http://nlp.lsi.upc.edu/freeling/.
See http://www.ilc.cnr.it/EAGLES96/annotate/annotate.html.
The integration of the PoS tags associated to each word was done using the LKB Simple PreProcessor Protocol (SPPP; http://wiki.delph-in.net/moin/LkbSppp). Figure 3 shows how PoS tags are associated to each lexical item in the derivation tree.
http://www.delph-in.net.
Human annotation is not only more expensive and slower, but it also introduces more errors and inconsistencies, because of the difficulty and tiring nature of the task.
Note that the grammar still lacks coverage for some constructions that appeared in the corpus.
FreeLing errors could not be corrected and caused about 5 % of sentences to be rejected.
We also developed a parser ensemble approach to select the linguistic analyses produced by SRG automatically using full agreement among the MaxEnt parse selection model and the MaltParser dependency parser (Nivre et al. 2007), which will be used in the near future to enlarge the treebank. We have performed some experiments using a set 1,428 sentences and obtained the following results: 445 sentences were selected out of the 1,428 sentences (31.2 %), precision (number of correctly selected sentences among all the selected sentences) stood at 90.6 % (403/445), while recall (number of correctly selected sentences among all the actually correctly ranked first sentences) was 46.6 % (403/864). See further details in Marimon et al. (2014).
Note that all instances of a given NE type (i.e. proper names, dates,...) are assigned the same lexical entry identifier, which means that there is only one lexical entry for each type. See Marimon (2013) for further details.
Completive clauses may also function as SUBJ, as in example (1).
1. (1)
  Sería preferible que
  
  be.CONDITIONAL.3RD.SG preferable that
  
  estuviesen más acordes.
  
  be.SUBJUNCTIVE.PRESENT.3RD.PL more consistent
  
  ‘It would be preferable that they were more consistent.’
The dependency labels CONJ, COORD, and ENUM, used in coordinated constructions, and SUBJ-GAP, COMP-GAP, and MOD-GAP, for subjects, complements, and modifiers in gapping constructions, will be discussed in the following subsections.
Spanish clitic pronouns can appear either attached to the right side of the host verb verb, the so-called enclitics, or as independent lexical units in front of the verb, known as proclitics. Infinitives, gerunds, and non-negated imperatives have enclitics, verbs in personal forms always require proclitics, and past participles cannot have clitics. As in the Spanish treebank described by McDonald et al. (2013), enclitics are not split and only proclitics are annotated.
A similar analysis is provided in AnCora (Taulé et al. 2008), where a takes the label DO and the head of the NP takes the the label “NP”, since the complements of a non-verbal head do not get a functional label in this treebank.
In passive constructions the verb has a unique argument which is the syntactic subject.
Zabokrtsky et al. (2013) describe some known problems related to coordination structure and present a taxonomy of various formal means developed for encoding it in dependency treebanks based on observations from a set of dependency treebanks for 26 languages.
In addition, evaluation results using the treebank with different parsing systems are discussed in (Padró et al. 2013).

References

Abeillé, A. (Ed.). (2003). Treebanks: Building and using parsed corpora. Dordrecht: Kluwer Academic Publishers.
Aduriz, I., Aranzabe, M. J., Arriola, J. M., Atutxa, A., de Ilarraza, A. D., Garmendia, A., et al. (2003). Construction of a Basque dependency treebank. In Proceedings of TLT-2003, Växjö, Sweden, pp. 201–204.
Afonso, S., Bick, E., Haber, R., & Santos, D. (2002). Floresta sint’a(c)tica: A treebank for Portuguese. In Proceedings of LREC-2002 Las Palmas de Gran Canaria. Spain, pp. 1968–1703.
Bangalore, S. (2003). Localizing dependencies and supertagging. In R. Bod, R. Scha, & K. Sima’an (Eds.), Data-oriented parsing (pp. 283–298). Chicago: CSLI Publications, University of Chicago Press.
Google Scholar
Böhmová, A., Hajic, J., Hajicová, E., & Hladká, B. (2003). The PDT: A 3-level annotation scenario. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 103–127). Dordrecht: Kluwer Academic Publishers.
Chapter Google Scholar
Bosco, C., Lombardo, V., Vassallo, D., & Lesmo, L. (2000). Building a treebank for Italian: a data-driven annotation schema. In Proceedings of LREC-2000, Athens, Greece.
Branco, A., Costa, F., Silva, J., Silveira, S., Castro, S., Avelãs, M., et al. (2010). Developing a deep linguistic databank supporting a collection of treebanks: The CINTIL DeepGramBank. In Proceedings of LREC-2010, La Valletta, Malta.
Brants, S., Dipper, S., Eisenberg, P., Hansen-Schirra, S., König, E., WolfgangLezius, G. S., et al. (2004). TIGER: linguistic interpretation of a German Corpus. In E. Hinrichs & K. Simov (Eds.), Research on language and computation (Vol. 2, pp. 597–619). Berlin: Springer.
Google Scholar
Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In Proceedings of CoNLL-X, New York City, USA.
Cabré, M. T., Bach, C., & Vivaldi, J. (2006). 10 anys del Corpus de l’IULA. Barcelona: Institut Universitari de Lingüística Aplicada, Universitat Pompeu Fabra.
Google Scholar
Carter, D. (1997). The TreeBanker: A tool for supervised training of parsed corpora. In Proceedings of AAAI-97, Providence, Rhode Island, pp. 598–603.
Chaitanya, G., Husain, S., & Mannem, P. (2011). Empty categories in Hindi dependency treebank: Analysis and recovery. In Proceedings of LAW V, Portland, USA.
Collins, M., Hajič, J., Ramshaw, L., & Tillmann, C. (1999). A statistical parser for Czech. In Proceedings of ACL-1999, pp. 505–512.
Copestake, A. (2002). Implementing typed feature structure grammars. Stanford: CSLI Publications.
Google Scholar
Copestake, A., Flickinger, D., Pollard, C., & Sag, I. A. (2006). Minimal recursion semantics: An introduction. Research on Language and Computation, 3(4), 281–332.
Google Scholar
Covington, M. (2001). A fundamental algorithm for dependency parsing. In Proceedings of the 39th annual ACM southeast conference, pp. 95–102.
Dzeroski, S., Erjavec, T., Ledinek, N., Pajas, P., Zabokrtsky, Z., & Zele, A. (2006). Towards a Slovene dependency treebank. In Proceedings of LREC-2006, Genoa, Italy.
Eisner, J. M. (1996a). An empirical comparison of probability models for dependency grammar IRCS-96-11. Technical report, Institute for Research in Cognitive Science, University of Pennsylvania.
Eisner, J. M. (1996b). Three new probabilistic models for dependency parsing: An exploration. In Proceedings of COLING 1996, Copenhagen, Denmark, pp. 340–345.
Eisner, J. M. (2000). Bilexical grammars and their cubic-time parsing algorithms. In H. Bunt & A. Nijholt (Eds.), Advances in probabilistic and other parsing technologies (pp. 29–62). Dordrecht: Kluwer Academic Publishers.
Chapter Google Scholar
Flickinger, D., Kordoni, V., Zhang, Y., Branco, A., Simov, K., Osenova, P., et al. (2012). ParDeepBank: Multiple parallel deep treebanking. In Proceedings of TLT-2012, Lisbon, Portugal, pp. 97–108.
Garside, R., Leech, G., & Váradi, T. (1992). Lancaster parsed corpus. A machine-readable syntactically analyzed corpus of 144,000 words, available for distribution through ICAME. Bergen: The Norwegian Computing Centre for the Humanities.
Harper, M. P., & Helzerman, R. A. (1995). Extensions to constraint dependency parsing for spoken language processing. Computer Speech and Language, 9, 187–234.
Article Google Scholar
Hashimoto, C., Bond, F., & Siegel, M. (2007). Semi-automatic documentation of an implemented linguistic grammar augmented with a treebank. Language Resources and Evaluation (Special issue on Asian language technology), 42(2), 117–126.
Hellwig, P. (1986). Dependency unification grammar. In Proceedings of COLING 1986, Bonn, Germany, pp. 195–198.
Hellwig, P. (2003). Dependency unification grammar. In V. Agel, L. M. Eichinger, H. W. Eroms, P. Hellwig, H. J. Heringer, & H. Lobin (Eds.), Dependency and valency (pp. 593–635). Berlin: Walter de Gruyter.
Google Scholar
Hudson, R. A. (1984). Word grammar. Oxford: Blackwell.
Google Scholar
Hudson, R. A. (1990). English word grammar. Cambridge: Blackwell.
Google Scholar
Husain, S., Mannem, P., Ambati, B., & Gadde, P. (2010). The ICON-2010 tools contest on Indian language dependency parsing. In Proceedings of ICON-2010 tools contest on Indian language dependency parsing, Kharagpur, India, pp. 1–8.
Järvinen, T., & Tapanainen, P. (1998). Towards an implementable dependency grammar. In Proceedings of the workshop on processing dependency-based grammars (ACL-COLING), Montreal, Canada, pp. 1–10.
Kakkonen, T. (2005). Dependency treebanks: Methods, annotation schemes and tools. In Proceedings of NODALIDA 2005, Joensuu, Finland, pp. 94–104.
Karlsson, F. (1990). Constraint grammar as a framework for parsing running text. In Proceedings of COLING 1990, Helsinki, Finland.
Karlsson, F., Voutilainen, A., Heikkil, J., & Anttila, A. (1995). Constraint Grammar: A Language-Independent System for Parsing Unrestricted Text. Berlin: Mouton de Gruyter.
Book Google Scholar
Kordoni, V., & Zhang, Y. (2009). Annotating wall street journal texts using a hand-crafted deep linguistic grammar. In Proceedings of LAW III, Suntec, Singapore.
Kromann, M. T. (2003). The Danish Dependency Treebank and the DTAG treebank tool. In Proceedings of TLT-2003, Växjö, pp. 217–220.
Kübler, S., McDonald, R., & Nivre, J. (2009). Dependency Parsing. In G. Hirst (Ed.), Synthesis Lectures on Human Language Technologies. Los Altos: Morgan and Claypool Publishers.
Google Scholar
Marcus, M., Beatrice, S., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19, 313–330.
Google Scholar
Marimon, M. (2010). The Tibidabo treebank. Procesamiento del Lenguaje Natural, 45, 113–119.
Google Scholar
Marimon, M. (2013). The Spanish DELPH-IN grammar. Language Resources and Evaluation, 47(2), 371–397.
Article Google Scholar
Marimon, M., Bel, N., & Padró, L. (2014). Automatic selection of HPSG-parsed sentences for Treebank construction. Computational Linguistics, 40(3).
Maruyama, H. (1990). Structural disambiguation with constraint propagation. In Proceedings of ACL 1990, Pittsburgh, PA, pp. 31–38.
McDonald, R., Crammer, K., & Pereira, F. (2005). Online large-margin training of dependency parsers. In Proceedings of ACL 2005, University of Michigan, USA, pp. 91–98.
McDonald, R., Nivre, J., Quirmbach-Brundage, Y., Goldberg, Y., Das, D., Ganchev, K., et al. (2013). Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the ACL 2013, Sofia, Bulgaria, pp. 92–97.
Mel’čuk, I. (1988). Dependency syntax: theory and practice. New York: State University of New York Press.
Google Scholar
Menzel, W., & Schröder, I. (1998). Decision procedures for dependency parsing using graded constraints. In Proceedings of the workshop on processing of dependency-based grammars (ACL-COLING), Montreal, Canada, pp. 78–87.
Montemagni, S., Barsotti, F., Battista, M., Calzolari, N., Corazzari, O., Lenci, A., et al. (2003). Building the Italian syntactic-semantic treebank. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora (pp. 189–210). Dordrecht: Kluwer Academic Publishers.
Chapter Google Scholar
Moreno, A., Grishman, R., López, S., Sánchez, F., & Sekine, S. (2000). Treebank of Spanish and its application to parsing. In Proceedings of LREC-2000, Athens, Greece.
Nasr, A., & Rambow, O. (2004). A simple string-rewriting formalism for dependency grammar. In Proceedings of the workshop on recent advances in dependency grammar (COLING), Geneva, Switzerland, pp. 25–32
Nilsson, J., Hall, J., & Nivre, J. (2005). MAMBA meets TIGER: Reconstructing a Swedish treebank from antiquity. In Proceedings of the special session on Treebanks (NODALIDA 2005), Finland, Joensuu, pp. 119–132
Nilsson, J., Nivre, J., & Hall, J. (2006). Graph transformations in data-driven dependency parsing. In Proceedings of ACL 2006, Sydney, Australia, pp. 257–264.
Nivre, J. (2005). Dependency grammar and dependency parsing, MSI report 05133. Technical report, Växjö University: School of Mathematics and Systems Engineering.
Nivre, J., Hall, J., & Nilsson, J. (2004). Memory-based dependency parsing. In Proceedings of CoNLL-2004, Boston, MA, USA, pp. 59–56.
Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryigit, G., Kubler, S., et al. (2007). Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering, 13(2), 95–135.
Google Scholar
Oepen, S., & Carroll, J. (2000). Performance profiling for parser engineering. In D. Flickinger, S. Oepen, J.-I. Tsujii, & H. Uszkoreit (Eds.), Natural Language Engineering (Vol. 6, part 1)—special issue: Efficiency processing with HPSG: Methods, systems, evaluation, (pp. 81–97). Cambridge: Cambridge University Press.
Oepen, S., Flickinger, D., Toutanova, K., & Manning, C. D. (2002). LinGo Redwoods. A rich and dynamic treebank for HPSG. In Proceedings of TLT-2002, Sozopol, Bulgaria, pp. 139–149.
Oflazer, K. (2003). Dependency parsing with an extended finite-state approach. Computational Linguistics, 29, 515–544.
Article Google Scholar
Oflazer, K., Say, B., Hakkani-Tür, D. Z., & Tür, G. (2003). Building a Turkish Treebank. In A. Abeillé (Ed.), Treebanks: Building and using parsed corpora. Dordrecht: Kluwer Academic Publishers.
Google Scholar
Padró, L., & Stanilovsky, E. (2012). FreeLing 3.0: Towards wider multilinguality. In Proceedings of LREC-2012, Istanbul, Turkey.
Padró, M., Ballesteros, M., Martínez H., & Bohnet, B. (2013). Finding dependency parsing limits over a large Spanish corpus. In Proceeding of the IJCNLP-2013.
Pollard, C., & Sag, I. A. (1987). Information-based syntax and semantics, Vol. I. Fundamentals. CSLI Lecture Notes, Stanford.
Pollard, C., & Sag, I. A. (1994). Head-driven phrase structure grammar. Chicago: The University of Chicago Press and CSLI Publications.
Google Scholar
Samuelsson, C. (2000). A statistical theory of dependency syntax. In Proceedings of COLING-2000, Saarbrücken, Germany, pp. 684–690
Seeker, W., & Kuhn, J. (2012). Making ellipses explicit in dependency conversion for a German treebank. In Proceeding of LREC-2012, Istanbul, Turkey.
Sgall, P., Hajičová, E., & Panevová, J. (1986). The meaning of the sentence in its pragmatic aspects. Dordrecht: Reidel.
Google Scholar
Simov, K., & Osenova, P. (2005). Extending the annotation of BulTreeBank: Phase 2. In Proceeding of TLT-2005, Barcelona, Spain, pp. 173–184.
Smrz, O., Bielicky, V., Kourilová, I., Krácmar, J., Hajic, J., & Zemánek, P. (2008). Prague Arabic dependency treebank: A word on the million words. In Proceedings of the workshop on Arabic and local languages, Marrakech, Morocco, pp. 16–23.
Taulé, M., Martí, M., & Recasens, M. (2008). AnCora: Multilevel Annotated Corpora for Catalan and Spanish. In Proceedings of LREC-2008, Marrakech, Morocco.
Tesnière, L. (1959). Eléments de syntaxe structurale. Paris: Librairie Klincksieck.
Google Scholar
Toutanova, K., Manning, C. D., Flickinger, D., & Oepen, S. (2005). Stochastic HPSG parse disambiguation using the Redwoods corpus. Research on Language and Computation, 3(1), 83–105.
Article Google Scholar
van der Beek, L., Bouma, G., Malouf, R., & van Noord, G. (2002). The Alpino dependency treebank. In Proceedings of CLIN-2001, Amsterdam, The Netherlands, pp. 8–22.
Vincze, V., Szauter, D., Almási, A., Móra, G., Alexin, Z., & Csirik, J. (2010). Hungarian dependency treebank. In Proceedings of LREC-2010, Valletta, Malta.
Vivaldi, J. (2009). Corpus and exploitation tool: IULACT and bwanaNet. In Actas del I Congreso Internacional de Lingüística de Corpus (CICL-09), Murcia, Spain, pp. 224–239.
Wang, W., & Harper, M. P. (2004). A statistical constraint dependency grammar (CDG) parser. In Proceedings of the workshop in incremental parsing: Bringing engineering and cognition together, pp. 42–49.
Yamada, H., & Matsumoto, Y. (2003). Statistical dependency analysis with support vector machines. In Proceedings of IWPT 2003, Nancy, France, pp. 195–206.
Zabokrtsky, Z., Stepanek, J., Popel, M., Zeman, D., & Marecek, D. (2013). Coordination structures in dependency treebanks. In Proceedings of ACL 2013, Sofia, Bulgaria, pp. 517–527.

Download references

Acknowledgments

This work was funded by Spanish Ministerio de Ciencia e Innovación under the programe Ramón y Cajal and the European project METANET4U. We would like to thank Blanca Arias, Beatriz Fisas, Mercé Lorente, Carlos Morell, Silvia Vázquez and Jorge Vivaldi for their participation in the METANET4U project, and three anonymous reviewers for their suggestions and comments.

Author information

Authors and Affiliations

Universitat de Barcelona, Gran Via de les Corts Catalanes 585, 08007, Barcelona, Spain
Montserrat Marimon
Universitat Pompeu Fabra, Roc Boronat 138, 08018, Barcelona, Spain
Núria Bel

Authors

Montserrat Marimon
View author publications
You can also search for this author in PubMed Google Scholar
Núria Bel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Montserrat Marimon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marimon, M., Bel, N. Dependency structure annotation in the IULA Spanish LSP Treebank. Lang Resources & Evaluation 49, 433–454 (2015). https://doi.org/10.1007/s10579-014-9280-5

Download citation

Published: 02 September 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10579-014-9280-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dependency structure annotation in the IULA Spanish LSP Treebank

Abstract

Access this article

Similar content being viewed by others

Prague Dependency Treebank

Sinica Treebank

Dependency Parsing of Turkish

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dependency structure annotation in the IULA Spanish LSP Treebank

Abstract

Access this article

Similar content being viewed by others

Prague Dependency Treebank

Sinica Treebank

Dependency Parsing of Turkish

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation