Research on Language and Computation

, Volume 6, Issue 3–4, pp 247–271 | Cite as

An Analysis of Human Judgements on Semantic Classification of Catalan Adjectives

  • Gemma Boleda
  • Sabine Schulte im Walde
  • Toni Badia


This article reports on a large-scale experiment for gathering human judgements with respect to a semantic classification of Catalan adjectives. The goal of our experiment was to classify 210 Catalan adjectives as basic, event-related, or object-related adjectives, allowing for multiple class assignments to account for polysemy. The experiment was directed at non-expert native speakers and administered via the Web, collecting data from 322 participants. We assess the degree of inter-annotator agreement through an innovative methodology based on observed agreement and kappa, and use weighted versions of these measures to account for partial agreement in polysemous assignments. Because the obtained scores (kappa 0.20–0.34) are too low to establish a reliably labelled dataset, we then perform a series of post-hoc analyses on the human judgements to investigate the sources of disagreement, by comparing the participants’ classifications with a classification obtained from experts. Our analysis shows that polysemous items and event-related adjectives are more problematic than other types of adjectives. Furthermore, the analysis helps to distinguish disagreement caused by the task as opposed to that caused by the experimental design, thus pointing to specific difficulties in both aspects of the research. The methodology developed for this analysis might therefore prove useful for the design of experiments for related tasks.


Adjectives Catalan Human judgements Inter-annotator agreement Semantic classes Web experiment 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Altaye, M., Donner, A., & Eliasziw, M. (2001). A general goodness-of-fit approach for inference procedures concerning the kappa statistic. Statistics in Medicine, 20(16), 2479–2488.CrossRefGoogle Scholar
  2. Artstein, R., & Poesio, M. (2005). Bias decreases in proportion to the number of annotators. In G. Jaeger, P. Monachesi, G. Penn, J. Rogers, & S. Wintner (Eds.), Proceedings of FG-MoL 2005 (pp. 141–150). Edinburgh.Google Scholar
  3. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 34(4), 555-596.CrossRefGoogle Scholar
  4. Bally, C. (1944). Linguistique générale et linguistique française. Berne: A. Francke.Google Scholar
  5. Boleda, G. (2007). Automatic acquisition of semantic classes for adjectives. Ph.D. Thesis, Pompeu Fabra University.Google Scholar
  6. Boleda, G., Schulte im Walde, S., & Badia, T. (2007). Modelling adjective polysemy as multi-label classification. In: Proceedings of the Joint Meeting of the Conference on Empirical Methods in Natural Language Processing and the Conference on Computational Natural Language Learning (pp. 171–180). Prague.Google Scholar
  7. Bosque, I., & Picallo, C. (1996). Postnominal adjectives in Spanish DPs. Journal of Linguistics, 32, 349–386.CrossRefGoogle Scholar
  8. Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249–254.Google Scholar
  9. Chierchia, G., & McConnell-Ginet, S. (2000). Meaning and grammar: An introduction to semantics (2nd ed.). Cambridge, MA: MIT Press.Google Scholar
  10. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  11. Cohen, J. (1968). Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin, 70, 213–220.CrossRefGoogle Scholar
  12. Corley, M., & Scheepers, C. (2002). Syntactic priming in English sentence production: Categorical and latency evidence from an internet-based study. Psychonomic Bulletin and Review, 9(1), 126–131.Google Scholar
  13. Di Eugenio, B., & Glass, M. (2004). The kappa statistic: A second look. Computational Linguistics, 30(1), 95–101.CrossRefGoogle Scholar
  14. Fellbaum, C., Grabowski, J., & Landes, S. (1998). Performance and confidence in a semantic annotation task. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (Chap. 9, pp. 217–237). Cambridge, MA: The MIT Press.Google Scholar
  15. Fleiss, J. L. (1981). Statistical methods for rates and proportions, Wiley series in Probability and Mathematical Statistics (2nd ed.). New York: John Wiley & Sons.Google Scholar
  16. Hamann, C. (1991). Adjectivsemantik/Adjectival semantics. In A. von Stechow, & D. Wunderlich (Eds.), Semantik/Semantics. Ein internationales Handbuch der Zeitgenössischen Forschung. An International Handbook of Contemporary Research (pp. 657–673). Berlin/New York: de Gruyter.Google Scholar
  17. Hripcsak, G., & Heitjan, D. F. (2002). Measuring agreement in medical informatics reliability studies. Journal of Biomedical Informatics, 35(2), 99–110.CrossRefGoogle Scholar
  18. Institut d’Estudis Catalans. (1997). Diccionari de la llengua catalana. Barcelona: Edicions 62.Google Scholar
  19. Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Newbury Park, CA: Sage.Google Scholar
  20. Krippendorff, K. (2004). Content analysis: An introduction to its methodology (2nd ed.). Thousand Oaks, CA: Sage.Google Scholar
  21. Landis, J. R., & Koch, G. C. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.CrossRefGoogle Scholar
  22. Lapata, M., McDonald, S., & Keller, F. (1999). Determinants of adjective-noun plausibility. In Proceedings of the 9th Conference of the European Chapter of the Association for Computational Linguistics (pp. 30–36). Bergen.Google Scholar
  23. Lui, K.-J., Cumberland, W. G., Mayer, J. A., & Eckhardt, L. (1999). Interval estimation for the intraclass correlation in dirichlet-multinomial data. Psychometrika, 64(3), 355–369.CrossRefGoogle Scholar
  24. McNally, L., & Boleda, G. (2004). Relational adjectives as properties of kinds. In O. Bonami, & P. C. Hofherr (Eds.), Empirical issues in syntax and semantics 5 (pp. 179–196).
  25. Melinger, A., & Schulte im Walde, S. (2005). Evaluating the relationships instantiated by semantic associates of verbs. In Proceedings of the 27th Annual Conference of the Cognitive Science Society. Stresa, Italy.Google Scholar
  26. Merlo, P., & Stevenson, S. (2001). Automatic verb classification based on statistical distributions of argument structure. Computational Linguistics, 27(3), 373–408.CrossRefGoogle Scholar
  27. Miller, K. J. (1998). Modifiers in WordNet. In C. Fellbaum (Ed.), WordNet: An Electronic Lexical Database (pp. 47–67). London: MIT.Google Scholar
  28. Nirenburg, S., & Raskin, V. (2004). Ontological semantics. Cambridge, MA: MIT Press.Google Scholar
  29. Passonneau, R. J. (2004). Computing reliability for coreference annotation. In Proceedings of the Language Resources and Evaluation Conference (vol. 4, pp. 1503–1506). Lisbon, Portugal.Google Scholar
  30. Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In: Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky. Ann Arbor, pp. 76–83.Google Scholar
  31. Pustejovsky, J. (1995). The generative lexicon. Cambridge, MA: MIT Press.Google Scholar
  32. Rafel, J. (1994). Un corpus general de referència de la llengua catalana. Caplletra, 17, 219–250.Google Scholar
  33. Raskin, V., & Nirenburg, S. (1998). An applied ontological semantic microtheory of adjective meaning for Natural Language Processing. Machine Translation, 13(2–3), 135–227.CrossRefGoogle Scholar
  34. Reips, U. D. (2002). Standards for internet-based experimenting. Experimental Psychology, 49(4), 243–256.Google Scholar
  35. Sanromà, R. (2003). Aspectes morfológics i sintàctics dels adjectius en català. Master’s Thesis, Universitat Pompeu Fabra.Google Scholar
  36. Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–432.Google Scholar
  37. Vendler, Z. (1957). Verbs and times. The Philosophical Review, 66, 143–60.CrossRefGoogle Scholar
  38. Verzani, J. (2005). Using R for introductory statistics. Boca Raton: Chapman & Hall/CRC.Google Scholar
  39. Véronis, J. (1998). A study of polysemy judgements and inter-annotator agreement. In: Programme and advanced papers of the Senseval workshop (pp. 2–4). Herstmonceux Castle, England.Google Scholar
  40. Zipf, G. K. (1949, Human behaviour and the principle of least-effort. Cambridge: Addison-Wesley.Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2008

Authors and Affiliations

  • Gemma Boleda
    • 1
  • Sabine Schulte im Walde
    • 2
  • Toni Badia
    • 3
  1. 1.Departament de Llenguatges i Sistemes InformàticsUniversitat Politècnica de CatalunyaBarcelonaSpain
  2. 2.Institute for Natural Language ProcessingUniversity of StuttgartStuttgartGermany
  3. 3.GLiComFundació Barcelona Media and Universitat Pompeu FabraBarcelonaSpain

Personalised recommendations