Abstract
This paper describes some results about the way syntactic representations and parsing methodologies affect the performance of systems for parsing Italian. Italian has a rich morphology, especially with respect to Verbal suffixes, that can provide a parser with useful information for making the correct choices. With respect to syntactic representation, the experiments are based on a treebank for Italian, which has been delivered both in a dependency and in a constituency formalism, and for each of them also annotated at different degrees of specificity. The two paradigms are compared, and the different degrees of specificity in marking some syntactic phenomena are pointed out. On the basis of this treebank, statistical parsers have been evaluated. The results have shown that both the representation format and the parsing approach strongly affect the performance, that in some cases are very close and in others drastically different from the ones that constitute the state of the art for English.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The CODICECIVILE and COSTITA corpora include legal texts, the EUDIR declarations of the European Community from the Italian section of the JRC-Acquis Multilingual Parallel Corpus (see http://langtech.jrc.it/JRC-Acquis.html). Instead NEWS corpus includes texts from Italian newspapers, WIKIPEDIA from the Italian section of Wikipedia, and VED a miscellanea from academic, journal and novels.
- 2.
The term token refers to all the objects annotated in the treebank, namely words, punctuation marks and null elements.
- 3.
English translations of the Italian examples are literal and so may appear awkward in English.
- 4.
According to the Word Grammar, many words qualify as Prepositions or Determiners which traditional grammar would have classified as AdVerbs or subordinating conjunctions.
- 5.
For instance, in Machine Translation if the source language allows argument deletion and the target language does not, in order to make possible for the system to handle the translation, it is crucial that in the source language the dropped argument is explicitly marked. An alike situation can happen in a translation from Italian (a typical pro-drop language where the subject deletion is very common with tensed Verbs) to English (where the subject is always lexically realized in tensed clauses).
- 6.
The term equi refers to the lacking Subject of the subordinate infinitive Verb, e.g. the Subject of the Verb “dormire” (sleep) in “Vuole dormire” ([He] wants [to] sleep).
- 7.
The projectivity constraint is maintained for TUT also in the CoNLL format.
- 8.
- 9.
- 10.
Apart from a few cases of English morphological features which do not exist (e.g. possessive ending) or do not correspond with Italian forms (e.g. comparative Adjective and Adverb).
- 11.
The inclusion of person, gender and number values in morphological tags were tested without yielding any improvement in the parser performance. The investigation of the effect of the inclusion of these features in the Italian case, or in that of other MRLs, can be of some interest for future works.
- 12.
English translation: The agreement is broken for three main motivations.
- 13.
Proper nouns are not marked in Italian in terms of number.
- 14.
In fact, in a dependency tree the relation subject marks an edge linking the verbal head with a dependent which can be distinguished from other verbal dependents only by the type of the relation.
- 15.
English translation: A right allowance is due to the owner.
- 16.
E.g. the tag PUT which represents the locative complement of the Verb “put”, or the tag DTV (dative) which is annotated in indirect objects when they are realized as prepositional phrases, i.e. not affected by the dative shift.
- 17.
The evaluation has been performed by using the MaltEval tools [31].
- 18.
This shows however that the test set, even if it shows the same balancement of TUT, does not represent at best the treebank in terms of relations and constructions.
- 19.
This is only partially explained by the sentence length, which is lower than 40 words only in the test set, and by the smaller size of the training set for the 10-fold cross validation.
- 20.
The ten most frequent relations in all the 1-Comp treebank (with respect to 72,149 annotated tokens) are ARG (30.3 %), RMOD (19.2 %), OBJ (4.5 %), SUBJ (3.9 %), END (3.3 %), TOP (3.2 %), COORD2ND\(+\)BASE (3.1 %), COORD\(+\)BASE (3.1 %), SEPARATOR (2.7 %), INDCOMPL (1.9 %).
- 21.
For what concerns in particular parsing of legal text, see also the Proceedings of the LREC 2012 Workshop on Semantic Processing of Legal Texts (SPLeT-2012), available at http://www.lrec-conf.org/proceedings/lrec2012/workshops/27.LREC%202012%20Workshop%20-Proceedings%20SPLeT.pdf.
- 22.
The tool is freely available from http://www.cis.upenn.edu/dbikel/software.html#comparator.
References
Alicante, A., Bosco, C., Corazza, A., Lavelli, A.: A treebank-based study on the influence of Italian word order on parsing performance. In: LREC, pp. 1985–1992 (2012)
Bosco, C.: A richer annotation schema for an Italian treebank. In: Proceedings of European Summer School on Logic Language and Information, Birmingham, UK (2000), http://www.di.unito.it/~bosco/publicat/esslli00.zip
Bosco, C.: Grammatical relation’s system in treebank annotation. In: Proceedings of Student Research Workshop of Joint ACL/EACL Meeting, Toulose, France (2001), http://www.di.unito.it/~bosco/publicat/acl-stud-ses-01.zip
Bosco, C.: A grammatical relation system for treebank annotation, Ph.D. thesis, University of Torino (2004)
Bosco, C.: Multiple-step treebank conversion: from dependency to Penn format. In: Proceedings of Linguistic Annotation Workshop at the ACL’07 (2007)
Bosco, C.: Linguistic knowledge extraction from corpus parallel annotations. In: Proceedings of XL Congresso della Società di Linguistica Italiana, Vercelli (2009), http://www.di.unito.it/~bosco/publicat/sli06.zip
Bos, J., Bosco, C., Mazzei, A.: Converting a dependency treebank to a categorial grammar treebank for Italian. In: Proceedings of the Eighth Workshop on Treebanks and Linguistic Theories, pp. 27–38. Milan (2009)
Bosco, C., Lavelli, A.: Annotation schema oriented evaluation for parsing validation. In: Proceedings of the 9th Workshop on Treebanks and Linguistic Theories (TLT-9), pp. 19–30. Tartu, Estonia (2010)
Bosco, C., Mazzei, A., Lavelli, A.: Looking back to the Evalita constituency parsing task: 2007–2011. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds.) Evaluation of Natural Language and Speech Tools for Italian—Proceedings of EVALITA 2011, pp. 46–57 (2012)
Bosco, C., Lombardo, V.: A relation-schema for treebank annotation. In: A. Cappelli, F.T. (ed.) Advances in Artificial Intelligence, LNCS, vol. 2829. Springer, Berlin (2003), http://www.di.unito.it/~bosco/publicat/aiia-03.zip
Bosco, C., Lombardo, V.: Comparing linguistic information in treebank annotations. In: Proceedings of the 5th International Language Resources and Evaluation Conference (2006), http://www.di.unito.it/~bosco/publicat/lrec06.zip
Bosco, C., Lombardo, V., Lesmo, L., Vassallo, D.: Building a treebank for Italian: a data-driven annotation schema. In: Proceedings of 2nd International Conference on Language Resources and Evaluation, Athens, Greece (2000), http://www.di.unito.it/~bosco/publicat/lrec00.zip
Bosco, C., Mazzei, A., Lombardo, V.: Evalita parsing task: an analysis of the first parsing system contest for Italian. Intell. Artif. 2(IV), 30–33 (2007)
Bosco, C., Mazzei, A., Lombardo, V.: Evalita’09 parsing task: constituency parsers and the Penn format for Italian. In: Proceedings of Evalita’09 (2009)
Bosco, C., Montemagni, S., Mazzei, A., Lombardo, V., Dell’Orletta, F., Lenci, A.: Evalita’09 parsing task: comparing dependency parsers and treebanks. In: Proceedings of Evalita’09, Reggio Emilia (2009)
Bosco, C., Montemagni, S., Mazzei, A., Lombardo, V., Dell’Orletta, F., Lenci, A., Lesmo, L., Attardi, G., Simi, M., Lavelli, A., Hall, J., Nilsson, J., Nivre, J.: Comparing the influence of different treebank annotations on dependency parsing. In: Proceedings of Language Resources and Evaluation Conference, pp. 1794–1801. Malta (2010)
Cheung, J.C., Penn, G.: Topological field parsing of German. In: Proceedings of ACL-IJCNLP’09, pp. 64–72. Singapore (2009)
Collins, M., Hajic, J., Ramshaw, L., Tillmann, C.: A statistical parser of Czech. In: Proceedings of the ACL’99 (1999)
Corazza, A., Lavelli, A., Satta, G.: An information-theoretic measure to evaluate parsing difficulty across treebanks. ACM Trans. Speech Lang. Process. 9(4), 7:1–7:31 (2013). http://doi.acm.org/10.1145/2407736.2407737
Dell’Orletta, F., Marchi, S., Montemagni, S., Venturi, G.: Domain adaptation for dependency parsing at Evalita 2011. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds.) Evaluation of Natural Language and Speech Tools for Italian—Proceedings of EVALITA 2011, pp. 58–69 (2012)
Green, S., Manning, C.D.: Better Arabic parsing: Baselines, evaluations, and analysis. In: Proceedings of COLING 2010 (2010)
Hajič, J., Böhmová, A., Hajičová, E., Vidová-Hladká, B.: The prague dependency treebank: a three-level annotation scenario. In: Abeillé, A. (ed.) Treebanks: Building and Using Parsed Corpora, pp. 103–127. Kluwer, Amsterdam (2000)
Hudson, R.: Word Grammar. Basil Blackwell, Oxford (1984)
Jones, B.E.M.: Exploring the role of punctuation in parsing natural text. In: Proceedings of COLING’94, pp. 421–425. Kyoto (1994)
Kübler, S., Rehbein, I., van Genabith, J.: TePaCoC a corpus for testing parser performance on complex German grammatical constructions. In: Proceedings of TLT-7, pp. 15–28. Groningen, The Netherlands (2009)
Lavelli, A., Hall, J., Nilsson, J., Nivre, J.: MaltParser at the Evalita 2009 dependency parsing task. In: Proceedings of Evalita’09, Reggio Emilia (2009)
Lesmo, L.: Use of semantic information in a syntactic dependency parser. In: Magnini, B., Cutugno, F., Falcone, M., Pianta, E. (eds.) Evaluation of Natural Language and Speech Tools for Italian—Proceedings of EVALITA 2011, pp. 13–20 (2012)
Lesmo, L.: The rule-based parser of the NLP group of the University of Torino. Intell. Artif. 2, 46–47 (2007)
Lesmo, L.: The Turin University parser at Evalita 2009. In: Proceedings of Evalita’09, Reggio Emilia (2009)
Lesmo, L., Lombardo, V., Bosco, C.: Treebank development: the TUT approach. In: Proceedings of ICON02, Mumbai, India (2002), http://www.di.unito.it/~bosco/publicat/icon02lesmo-et-al.zip
Nilsson, J., Nivre, J.: MaltEval: An evaluation and visualization tool for dependency parsing. In: Proceedings of LREC’08, pp. 161–166. Marrakech (2008)
Nivre, J., Hall, J., Nilsson, J.: MaltParser: A data-driven parser-generator for dependency parsing. In: Proceedings of LREC’06, pp. 2216–2219. Genova (2006)
Petrov, S., Klein, D.: Improved inference for unlexicalized parsing. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 404–411. Rochester, New York (April 2007). http://www.aclweb.org/anthology/N/N07/N07-1051
Rimell, L., Clark, S., Steedman, M.: Unbounded dependency recovery for parser evaluation. In: Proceedings of Empirical Methods in Natural Language Processing ’09, pp. 813–821. Singapore (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Alicante, A., Bosco, C., Corazza, A., Lavelli, A. (2015). Evaluating Italian Parsing Across Syntactic Formalisms and Annotation Schemes. In: Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds) Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, vol 589. Springer, Cham. https://doi.org/10.1007/978-3-319-14206-7_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-14206-7_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14205-0
Online ISBN: 978-3-319-14206-7
eBook Packages: EngineeringEngineering (R0)