Abstract
This paper describes a methodology aimed at grouping Catalan verbs according to their syntactic behavior. Our goal is to acquire a small number of basic classes with a high level of accuracy, using minimal resources. Information on syntactic class, expensive and slow to compile by hand, is useful for any NLP task requiring specific lexical information. We show that it is possible to acquire this kind of information using only a POS-tagged corpus. We perform two clustering experiments. The first one aims at classifying verbs into transitive, intransitive and verbs alternating with a se-construction. Our system achieves an average 0.84 F-score, for a task with a 0.33 baseline. The second experiment aims at further distinguishing among pure intransitives and verbs bearing a prepositional object. The baseline for the task is 0.51 and the upperbound 0.98. The system achieves an average 0.88 F-score.
Similar content being viewed by others
Abbreviations
- inf:
-
infinitive
- fut:
-
future tense
- OBJcli:
-
object clitic
- VASE:
-
Verbs alternating with se
References
Alsina À, Badia T, Boleda G, Bott S, Gil À, Quixal M, Valentín O (2002) CATCG: a general purpose parsing tool applied. In: Proceedings of third international conference on language resources and evaluation. Las Palmas, Spain
Banko M, Brill E (2001) Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL 2001. Toulouse, France, pp 26–33
Bartra A (2002) La passiva i les construccions que s’hi relacionen. In: Solà J (ed) Gramàtica del català contemporani. Empúries, Barcelona, pp 2111–2179
Boleda G, Bott S, Meza R, Castillo C, Badia T, López V (2006) CUCWeb: a Catalan corpus built from the Web. In: Proceedings of the 2nd Web as Corpus Workshop, celebrated in conjuction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL). Trento, Italy, April 3
Brent M (1993) From grammar to lexicon: unsupervised learning of lexical syntax. Comput Linguist 19(2):243–262
Briscoe T, Carroll J (1997) Automatic extraction of subcategorization from corpora. In: Proceedings of the 5th conference on applied natural language processing (ANLP-97), Washington, USA
Church KW, Mercer RL (1993) Introduction to the special issue on computational linguistics using large corpora. Comput Linguist 19(1):1–24
Hernanz ML, Brucart JM (1987) La sintaxis. Crítica, Barcelona
Ide N, Véronis J (1998) Introduction to the special issue on word sense disambiguation: the state of the art. Comput Linguist 24(1):1–40
Karypis G (2002) CLUTO: a clustering toolkit. CLUTO 2.0 user manual
Kaufman L, Rousseeuw PJ (1990) Finding groups in data: an introduction to cluster analysis. John Wiley & Sons, New York City, NY
Korhonen A, Krymolowswski Y, Marx Z (2003) Clustering polysemic subcategorization frame distributions semantically. In: Proceedings of the 41st annual meeting of the association for computer linguistics. Sapporo, Japan, pp 64–71
Manning C (1993) Automatic acquisition of a large subcategorisation dictionary from corpora. In: Proceedings of the 31st annual meeting of the association for computer linguistics. Columbus, USA, pp 235–242
Merlo P, Stevenson S (2001) Automatic verb classification based on statistical distributions of argument structure. Comput Linguist 27(3):373–408
Rafel J (1994) Un corpus general de referència de la llengua catalana. Caplletra 17:219–250
Rosselló J (2002) El SV, I: verb i arguments verbals. In: Solà J (ed) Gramàtica del català contemporani. Empúries, Barcelona, pp 1853–1949
Schulte im Walde S (2000) Clustering verbs semantically according to their alternation behaviour. In: Proceedings of the 18th international conference on computational linguistics (COLING-00). Saarbruecken, Germany, pp 747–753
Vallduví E, Engdahl E (1996) The linguistic realization of information packaging. Linguistics 34:459–519
Acknowledgements
Many thanks to Toni Martí and Enric Vallduví and all the colleagues from the GLiCom for their useful comments. Special thanks are due to the Institut d’Estudis Catalans for lending us the research corpus, and to Nadjet Bouayad and Sebastian Padó for a critical revision of a previous version of this paper. Also thanks to Tom Rozario for language revision. This work is supported by the Departament d’Universitats, Recerca i Societat de la Informació (grants 2003FI-00867 and 2001FI-00582), and by the Fundación Caja Madrid.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mayol, L., Boleda, G. & Badia, T. Automatic acquisition of syntactic verb classes with basic resources. Lang Resources & Evaluation 39, 295–312 (2005). https://doi.org/10.1007/s10579-006-9000-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-006-9000-x