Language Resources and Evaluation

, Volume 46, Issue 2, pp 219–252 | Cite as

Multiplicity and word sense: evaluating and learning from multiply labeled word sense annotations

  • Rebecca J. Passonneau
  • Vikas Bhardwaj
  • Ansaf Salleb-Aouissi
  • Nancy Ide
Original Paper


Supervised machine learning methods to model word sense often rely on human labelers to provide a single, ground truth label for each word in its context. We examine issues in establishing ground truth word sense labels using a fine-grained sense inventory from WordNet. Our data consist of a sentence corpus of 1,000 sentences: 100 for each of ten moderately polysemous words. Each word was given multiple sense labels—or a multilabel—from trained and untrained annotators. The multilabels give a nuanced representation of the degree of agreement on instances. A suite of assessment metrics is used to analyze the sets of multilabels, such as comparisons of sense distributions across annotators. Our assessment indicates that the general annotation procedure is reliable, but that words differ regarding how reliably annotators can assign WordNet sense labels, independent of the number of senses. We also investigate the performance of an unsupervised machine learning method to infer ground truth labels from various combinations of labels from the trained and untrained annotators. We find tentative support for the hypothesis that performance depends on the quality of the set of multilabels, independent of the number of labelers or their training.


Word sense annotation Multilabel learning Inter-annotator reliability 



This work was supported by NSF award CRI-0708952, including a supplement to fund co-author Vikas Bhardwaj as a Graduate Research Assistant for one semester. The authors thank the annotators for their excellent work and thoughtful comments on sense inventories. We thank Bob Carpenter for discussions about data from multiple annotators, and for his generous and insightful comments on drafts of the paper. Finally, we thank the anonymous reviewers who provided deep and thoughtful critiques, as well as very careful proofreading.


  1. Agirre, E., de Lacalle, O. L., Fellbaum, C., Hsieh, S. K., Tesconi, M., Monachini, M., Vossen, P., & Segers, R. (2010). SemEval-2010 Task 17: All-words word sense disambiguation on a specific domain. In Proceedings of the 5th international workshop on semantic evaluation (pp. 75–80).Google Scholar
  2. Akkaya, C., Conrad, A., Wiebe, J., & Mihalcea, R. (2010). Amazon Mechanical Turk for subjectivity word sense disambiguation. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk, Association for Computational Linguistics, Los Angeles (pp. 195–203).Google Scholar
  3. Bhardwaj, V., Passonneau, R. J., Salleb-Aouissi, A., & Ide, N. (2010). Anveshan: A framework for analysis of multiple annotators’ labeling behavior. In Proceedings of the fourth linguistic annotation workshop (LAW IV).Google Scholar
  4. Bruce, R. F., & Wiebe, J. M. (1999). Decomposable modeling in natural language processing. Computational Linguistics, 25(2), 195-208.Google Scholar
  5. Callison-Burch, C. (2009). Fast, cheap, and creative: evaluating translation quality using Amazon’s Mechanical Turk. In Proceedings of the 2009 conference on empirical methods in natural language processing, Association for Computational Linguistics, Morristown, NJ (pp. 286–295).Google Scholar
  6. Callison-Burch, C., & Dredze, M. (2010). Creating speech and language data with Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s Mechanical Turk (pp. 1–12).Google Scholar
  7. Chugur, I., Gonzalo, J., & Verdejo, F. (2002). Polysemy and sense proximity in the SENSEVAL-2 test suite. In Proceedings of the SIGLEX/SENSEVAL workshop on word sense disambiguation: Recent successes and future directions, Philadelphia (pp. 32–39).Google Scholar
  8. Cohen, J. (1960). A coeffiecient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.CrossRefGoogle Scholar
  9. Diab, M. (2004). Relieving the data acquisition bottleneck in word sense disambiguation. In Proceedings of the 42nd annual meeting on association for computational linguistics (pp. 303–311).Google Scholar
  10. Dowty, D. (1979). Word meaning and montague grammar. Dordrecht: D. Reidel.CrossRefGoogle Scholar
  11. Erk, K. (2009). Representing words as regions in vector space. In CoNLL ’09: Proceedings of the 13th conference on computational natural language learning (pp. 57–65).Google Scholar
  12. Erk, K., & Mccarthy, D. (2009). Graded word sense assignment. In Proceedings of empirical methods in natural language processing (EMNLP 09) (pp. 440–449).Google Scholar
  13. Erk, K., McCarthy, D., & Gaylord, N. (2009). Investigations on word senses and word usages. In Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing (pp. 10–18).Google Scholar
  14. Fillmore, C. J., Johnson, C. R., & Petruck, M. R. L. (2003). Background to framenet. International Journal of Lexicography, 16(3), 235–250.CrossRefGoogle Scholar
  15. Hovy, E., Marcus, M., Palmer, M., Ramsha, L., & Weischedel, R. (2006). Ontonotes: The 90% solution. In Proceedings of HLT-NAACL 2006 (pp. 57–60).Google Scholar
  16. Ide, N. (2000). Cross-lingual sense determination: Can it work? Computers and the Humanities. Special Issue on the proceedings of the SIGLEX/SENSEVAL Workshop, 34(1–2), 223–234.Google Scholar
  17. Ide, N., & Wilks, Y. (2006). Making sense about sense. In E. Agirre & P. Edmonds (Eds.), Word sense disambiguation: Algorithms and applications (pp. 47–74). Dordrecht: Springer.CrossRefGoogle Scholar
  18. Ide, N., Erjavec, T., & Tufis, D. (2002). Sense discrimination with parallel corpora. In Proceedings of ACL’02 workshop on word sense disambiguation: Recent successes and future directions (pp. 54–60).Google Scholar
  19. Ide, N., Baker, C., Fellbaum, C., & Passonneau, R. J. (2010). The manually annotated sub-corpus: A community resource for and by the people. In Proceedings of the association for computational linguistics (pp. 68–73).Google Scholar
  20. Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31, 91–113.CrossRefGoogle Scholar
  21. Kilgarriff, A. (1998). SENSEVAL: An exercise in evaluating word sense disambiguation programs. In Proceedings of the 1st international conference on language resources and evaluation (LREC), Granada (pp. 581–588).Google Scholar
  22. Klein, D., & Murphy, G. (2002). Paper has been my ruin: Conceptual relations of polysemous words. Journal of Memory and Language, 47, 548.CrossRefGoogle Scholar
  23. Krippendorff, K. (1980). Content analysis: An introduction to its methodology. Beverly Hills, CA: Sage Publications.Google Scholar
  24. Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22(1), 79–86.CrossRefGoogle Scholar
  25. Landauer, T., & Dumais, S. (1977). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240.CrossRefGoogle Scholar
  26. Lavrac, N., Flach, P. A., & Zupan, B. (1999). Rule evaluation measures: a unifying view. In Proceedings of the 9th international workshop on inductive logic programming (ILP-99) (pp. 174–185).Google Scholar
  27. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151.CrossRefGoogle Scholar
  28. Manandhar, S., Klapaftis, I., Dligach, D., & Pradhan, S. (2010). SemEval-2010 task 14: Word sense induction & disambiguation. In Proceedings of the 5th international workshop on semantic evaluation (SemEval), Association for Computational Linguistics, Uppsala, Sweden (pp. 63–68).Google Scholar
  29. Miller, G. A., Beckwith, R., Fellbaum, C., Gross, D., & Miller, K. (1993). Introduction to WordNet: An on-line lexical database (revised). Tech. Rep. Cognitive Science Laboratory (CSL) Report 43, Princeton University, Princeton. Revised March 1993.Google Scholar
  30. Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys, 41(2), 10:1–10:69.CrossRefGoogle Scholar
  31. Ng, H. T., Lim, C. Y., & Foo, S. K. (1999). A case study on inter-annotator agreement for word sense disambiguation. In SIGLEX Workshop On Standardizing Lexical Resources.Google Scholar
  32. Palmer, M., Dang, H. T., & Fellbaum, C. (2007). Making fine-grained and coarse-grained sense distinctions, both manually and automatically. Natural Language Engineering, 13(2), 137–163.Google Scholar
  33. Passonneau, R. J. (1997). Applying reliability metrics to co-reference annotation. Technical Report, Department of Computer Science, CUCS-017-97, Columbia University.Google Scholar
  34. Passonneau, R. J. (2006). Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Fifth international conference on language resources and evaluation (LREC).Google Scholar
  35. Passonneau, R. J., Habash, N., & Rambow, O. (2006). Inter-annotator agreement on a multilingual semantic annotation task. In Proceedings of the international conference on language resources and evaluation (LREC), Genoa, Italy (pp. 1951–1956).Google Scholar
  36. Passonneau, R. J., Salleb-Aouissi, A., & Ide, N. (2009). Making sense of word sense variation. In Proceedings of the NAACL-HLT 2009 workshop on semantic evaluations.Google Scholar
  37. Passonneau, R. J., Salleb-Aouissi, A., Bhardwaj, V., & Ide, N. (2010). Word sense annotation of polysemous words by multiple annotators. In Seventh international conference on language resources and evaluation (LREC).Google Scholar
  38. Passonneau, R. J., Baker, C., Fellbaum, C., & Ide, N. (2012). The MASC word sense sentence corpus. In Proceedings of the 8th international conference on language resources and evaluation (LREC), Istanbul, Turkey, May 23–25.Google Scholar
  39. Pedersen, T. (2002a). Assessing system agreement and instance difficulty in the lexical sample tasks of SENSEVAL-2. In Proceedings of the ACL-02 workshop on word sense disambiguation: Recent successes and future directions (pp. 40–46).Google Scholar
  40. Pedersen, T. (2002b). Evaluating the effectiveness of ensembles of decision trees in disambiguating SENSEVAL lexical samples. In Proceedings of the ACL-02 workshop on word sense disambiguation: Recent successes and future directions (pp. 81–87).Google Scholar
  41. Piatetsky-Shapiro, G. (1999). Discovery, analysis and presentation of strong rules. In G. Piatetsky-Shapiro & W. J. Frawley (Eds.), Knowledge discovery in databases (pp. 229–248). Menlo Park, CA: AAAI Press.Google Scholar
  42. Poesio, M., & Artstein, R. (2005). The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the workshop on frontiers in corpus annotation II: Pie in the sky (pp. 76–83).Google Scholar
  43. Pradhan, S., Loper, E., Dligach, D., & Palmer, M. (2007). SemEval-2007 Task-17: English lexical sample, SRL and all words. In Proceedings of 4th international workshop on semantic evaluations (SemEval-2007), Prague, Czech Republic (pp. 87–92).Google Scholar
  44. Raykar, V. C., Yu, S., Zhao, L. H., Jerebko, A., Florin, C., Valadez, G. H., Bogoni, L., & Moy, L. (2009). Supervised learning from multiple experts: whom to trust when everyone lies a bit. In Proceedings of the 26th annual international conference on machine learning (ICML 09), New York, NY (pp. 889–896).Google Scholar
  45. Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., & Moy, L. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322.Google Scholar
  46. Resnik, P., & Yarowsky, D. (1999). Distinguishing systems and distinguishing senses: New evaluation methods for word sense disambiguation. Natural Language Engineering, 5(2), 113–133.CrossRefGoogle Scholar
  47. Ruppenhofer, J., Ellsworth, M., Petruck, M. R. L., Johnson, C. R., & Scheffczyk, J. (2006). Framenet II: Extended theory and practice. Available from
  48. Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 17, 321–325.CrossRefGoogle Scholar
  49. Sheng, V. S., Provost, F., & Ipeirotis, P. G. (2008). Get another label? improving data quality and data mining using multiple noisy labelers. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, New York, NY, USA, KDD ’08 (pp. 614–622).Google Scholar
  50. Snow, R., Jurafsky, D., & Ng, A. Y. (2007). Learning to merge word senses. In Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning, Prague (pp. 1005–1014).Google Scholar
  51. Snow, R., O’Connor, B., Jurafsky, D., & Ng, A. Y. (2008). Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of Empirical Methods in Natural Language Processing (EMNLP), Honolulu (pp. 254–263).Google Scholar
  52. Sorokin, A., & Forsyth, D. (2008). Utility data annotation with Amazon Mechanical Turk. In Computer ision and Pattern Recognition Workshops (CVPRW 08), First IEEE workshop on internet vision, pp. 1–8.Google Scholar
  53. Véronis, J. (1998). A study of polysemy judgements and inter-annotator agreement. In SENSEVAL Workshop, Sussex.Google Scholar
  54. Whitehill, J., Ruvolo, P., Wu, T. fan, Bergsma, J., & Movellan, J. (2000). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, & A. Culotta (Eds.), Advances in Neural Information Processing Systems 22 (pp. 2035–2043). Cambridge: MIT Press.Google Scholar
  55. Yan, Y., Rosales, R., Fung, G., Schmidt, M., Hermosillo, G., Bogoni, L., Moy, L. G., & Dy, J. (2010). Modeling annotator expertise: Learning when everybody knows a bit of something. In Proceedings of the 13th international conference on artificial intelligence and statistics (AISTATS) (pp. 932–939).Google Scholar

Copyright information

© Springer Science+Business Media B.V. 2012

Authors and Affiliations

  • Rebecca J. Passonneau
    • 1
  • Vikas Bhardwaj
    • 1
  • Ansaf Salleb-Aouissi
    • 1
  • Nancy Ide
    • 2
  1. 1.Columbia UniversityNew YorkUSA
  2. 2.Vassar CollegePoughkeepsieUSA

Personalised recommendations