Skip to main content
Log in

SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper is an extended description of SemEval-2014 Task 1, the task on the evaluation of Compositional Distributional Semantics Models on full sentences. Systems participating in the task were presented with pairs of sentences and were evaluated on their ability to predict human judgments on (1) semantic relatedness and (2) entailment. Training and testing data were subsets of the SICK (Sentences Involving Compositional Knowledge) data set. SICK was developed with the aim of providing a proper benchmark to evaluate compositional semantic systems, though task participation was open to systems based on any approach. Taking advantage of the SemEval experience, in this paper we analyze the SICK data set, in order to evaluate the extent to which it meets its design goal and to shed light on the linguistic phenomena that are still challenging for state-of-the-art computational semantic systems. Qualitative and quantitative error analyses show that many systems are quite sensitive to changes in the proportion of sentence pair types, and degrade in the presence of additional lexico-syntactic complexities which do not affect human judgements. More compositional systems seem to perform better when the task proportions are changed, but the effect needs further confirmation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The original SICK data set and all the derived versions used for the analyses carried out in this paper can be downloaded at http://clic.cimec.unitn.it/composes/sick.html.

  2. http://nlp.cs.illinois.edu/HockenmaierGroup/data.html.

  3. http://www.cs.york.ac.uk/semeval-2012/task6/index.php?id=data.

  4. Inter-rater agreement figures given in this paper for both relatedness and entailment slightly differ from those reported in Marelli et al. (2014b), due to a small bug that has been fixed.

  5. A comparable variance range is obtained running the same simulation on the number of CONTRADICTION pairs (1424).

  6. http://alt.qcri.org/semeval2014/task1/.

  7. http://alt.qcri.org/semeval2014/task1/index.php?id=data-and-tools.

  8. ITTK’s primary run could not be evaluated due to technical problems with the submission. The best ITTK’s non-primary run scored 0.76 r in the relatedness task and 78.2 % accuracy in the entailment task.

  9. Despite the fact that SICK test set contains a total of 4906 sentence pairs, we could not create a larger balanced test set. Each class had to be composed of only 150 pairs since the Norm-Diff cross set class is very small, containing only 168 pairs.

References

  • Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). SemEval-2012 Task 6: A pilot on semantic textual similarity. In Proceedings of SemEval 2012: The Sixth International Workshop on Semantic Evaluation.

  • Alves, A. O., Ferrugento, A., Lorenço, M., & Rodrigues, F. (2014). ASAP: Automatic semantic alignment for phrases. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of EMNLP.

  • Beltagy, I., Roller, S., Boleda, G., Erk, K., & Mooney, R. J. (2014). UTexas: Natural language semantics using distributional semantics and probablisitc logic. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Bentivogli, L., Dagan, I., Dang, H. T., Giampiccolo, D., & Magnini, B. (2009). The fifth PASCAL recognizing textual entailment challenge. In Proceedings of the Text Analysis Conference.

  • Bestgen, Y. (2014). CECL: A new baseline and a non-compositional approach for the SICK benchmark. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Biçici, E., & Way, A. (2014). RTM-DCU: Referential translation machines for semantic similarity. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Bjerva, J., Bos, J., van der Goot, R., & Nissim, M. (2014). The meaning factory: Formal semantics for recognizing textual entailment and determining semantic similarity. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics, 32(1), 13–47.

    Article  Google Scholar 

  • Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of ACL.

  • Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In J. Quiñonero-Candela, I. Dagan, B. Magnini & F. d’Alché–Buc (Eds.), Machine learning challenges. Evaluating predictive uncertainty, visual object classification, and recognising textual entailment (pp. 177–190). Heidelberg: Springer.

  • Ferrone, L., & Zanzotto, F. M. (2014). haLF: Comparing a pure CDSM approach and a standard ML system for RTE. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of IJCAI.

  • Grefenstette, E., & Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of EMNLP.

  • Gupta, R., Hannah Bechara, I. E. M., & Orasǎn, C. (2014). UoW: NLP techniques developed at the university of wolverhampton for semantic similarity and textual entailment. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47, 853–899.

  • Jimenez, S., Duenas, G., Baquero, J., & Gelbukh, A. (2014). UNAL-NLP: Combining soft cardinality features for semantic textual similarity, relatedness and entailment. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Lai, A., & Hockenmaier, J. (2014). Illinois-LH: A denotational and distributional approach to semantics. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • León, S., Vilarino, D., Pinto, D., Tovar, M., Beltrán, B. (2014). BUAP: Evaluating compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Lien, E., & Kouylekov, M. (2014). UIO-Lien: Entailment recognition using minimal recursion semantics. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., & Zamparelli, R. (2014a). SemEval-2014 Task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R., & Zamparelli, R. (2014b). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of LREC.

  • Mitchell, J., & Lapata, M. (2008) .Vector-based models of semantic composition. In Proceedings of ACL.

  • Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1429.

    Article  Google Scholar 

  • Pagin, P., & Westerståhl, D. (2010). Compositionality i: Definitions and variants. Philosophy Compass, 5(3), 250–264. doi:10.1111/j.1747-9991.2009.00228.x.

    Article  Google Scholar 

  • Proisl, T., & Evert, S. (2014). SemantiKLUE: Robust semantic similarity at multiple levels using maximum weight matching. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Socher, R., Huval, B., Manning, C., & Ng, A. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of EMNLP.

  • Vo, A. N. P., Popescu, O., & Caselli, T. (2014). FBK-TR: SVM for semantic relatedness and corpus patterns for RTE. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

  • Zhao, J., Zhu, T. T., & Lan, M. (2014). ECNU: One stone two birds: Ensemble of heterogenous measures for semantic relatedness and textual entailment. In Proceedings of SemEval 2014: International Workshop on Semantic Evaluation.

Download references

Acknowledgments

We thank the creators of the ImageFlickr, MSR-Video, and SemEval-2012 STS data sets for granting us permission to use their data for the task. The University of Trento authors were supported by ERC 2011 Starting Independent Research Grant No. 283554 (COMPOSES).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raffaella Bernardi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bentivogli, L., Bernardi, R., Marelli, M. et al. SICK through the SemEval glasses. Lesson learned from the evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. Lang Resources & Evaluation 50, 95–124 (2016). https://doi.org/10.1007/s10579-015-9332-5

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9332-5

Keywords

Navigation