A Formal and Empirical Study of Unsupervised Signal Combination for Textual Similarity Tasks

  • Enrique Amigó
  • Fernando Giner
  • Julio Gonzalo
  • Felisa Verdejo
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10193)


We present an in-depth formal and empirical comparison of unsupervised signal combination approaches in the context of tasks based on textual similarity. Our formal study introduces the concept of Similarity Information Quantity, and proves that the most salient combination methods are all estimations of Similarity Information Quantity under different statistical assumptions that simplify the computation. We also prove a Minimal Voting Performance theorem stating that, under certain plausible conditions, estimations of Information Quantity should at least match the performance of the best measure in the set. This explains, at least partially, why unsupervised combination methods perform robustly. Our empirical analysis compares a wide range of unsupervised combination methods in six different Information Access tasks based on textual similarity: Document Retrieval and Clustering, Textual Entailment, Semantic Textual Similarity, and the automatic evaluation of Machine Translation and Summarization systems. Empirical results on all datasets corroborate the result of the formal analysis and help establishing recommendations on which combining method to use depending on nature of the set of measures to be combined.



This research was supported by the Spanish Ministry of Science and Innovation (VoxPopuli Project, TIN2013-47090-C3-1-P and Vemodalen, TIN2015-71785-R).


  1. 1.
    Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: Semeval-2012 task 6: A pilot on semantic textual similarity. In: *SEM 2012: The First Joint Conference on Lexical and Computational Semantics (SemEval 2012), pp. 385–393. Association for Computational Linguistics, Montréal, Canada, 7–8 June 2012Google Scholar
  2. 2.
    Akiba, Y., Imamura, K., Sumita, E.: Using multiple edit distances to automatically rank machine translation output. In: Proceedings of Machine Translation Summit VIII, pp. 15–20 (2001)Google Scholar
  3. 3.
    Albrecht, J., Hwa, R.: The role of pseudo references in MT evaluation. In: Proceedings of the Third Workshop on Statistical Machine Translation, pp. 187–190 (2008)Google Scholar
  4. 4.
    Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: Combining evaluation metrics via the unanimous improvement ratio and its application to clustering tasks. J. Artif. Intell. Res. (JAIR) 42, 689–718 (2011)MathSciNetzbMATHGoogle Scholar
  5. 5.
    Artiles, J., Amigó, E., Gonzalo, J.: The role of named entities in web people search. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 2, EMNLP 2009, pp. 534–542. Association for Computational Linguistics (2009)Google Scholar
  6. 6.
    Artiles, J., Gonzalo, J., Sekine, S.: The SemEval-2007 WePS evaluation: establishing a benchmark for the web people search task. In: Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval 2007, pp. 64–69. Association for Computational Linguistics, Stroudsburg (2007)Google Scholar
  7. 7.
    Aslam, J.A., Savell, R.: On the effectiveness of evaluating retrieval systems in the absence of relevance judgments. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2003, pp. 361–362. ACM, New York (2003)Google Scholar
  8. 8.
    Bar-Haim, R., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B., Szpektor, I.: The second PASCAL recognising textual entailment challenge. In: Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment (2006)Google Scholar
  9. 9.
    Linguistic Data Consortium. Linguistic Data Annotation Specification: Assessment of Adequacy and Fluency in Translations. Revision 1.5. Technical report (2005)Google Scholar
  10. 10.
    Cormack, G.V., Clarke, C.L.A., Büttcher, S.: Reciprocal rank fusion outperforms condorcet and individual rank learning methodsGoogle Scholar
  11. 11.
    Corston-Oliver, S., Gamon, M., Brockett, C.: A machine learning approach to the automatic evaluation of machine translation. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 140–147 (2001)Google Scholar
  12. 12.
    Dang, H.T.: Overview of DUC 2005. In: Proceedings of the 2005 Document Understanding Workshop (2005)Google Scholar
  13. 13.
    Dang, H.T.: Overview of DUC 2006. In: Proceedings of the 2006 Document Understanding Workshop (2006)Google Scholar
  14. 14.
    de Borda, J.C.: Memoire sur les Elections au Scrutin. Histoire de l’Academie Royale des Sciences, Paris (1781)Google Scholar
  15. 15.
    de Condorcet, M.: Essai Sur l’Application de l’Analyse Á la Probabilite des Decisions Rendues e la Pluralite des Voix (1785)Google Scholar
  16. 16.
    Giménez, J., Màrquez, L.: Asiya: an open toolkit for automatic machine translation (meta-)evaluation. Prague Bull. Math. Linguist. 94, 77–86 (2010)CrossRefGoogle Scholar
  17. 17.
    Kaniovski, S., Zaigraev, A.: Optimal jury design for homogeneous juries with correlated votes. Theory Decis. 71(4), 439–459 (2011)MathSciNetCrossRefzbMATHGoogle Scholar
  18. 18.
    Kuncheva, L.I., Whitaker, C.J., et al.: Is independence good for combining classifiers? pp. 168–171 (2000)Google Scholar
  19. 19.
    Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)CrossRefzbMATHGoogle Scholar
  20. 20.
    Lin, C.-Y.: Rouge: a package for automatic evaluation of summaries. In: Moens, M.-F., Szpakowicz, S., (eds.), Text Summarization Branches Out: Proceedings of the ACL-2004 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain, July 2004Google Scholar
  21. 21.
    Lin, D.: Dependency-based evaluation of MINIPAR. In: Proceedings of Workshop on the Evaluation of Parsing Systems, Granada (1998)Google Scholar
  22. 22.
    Liu, D., Gildea, D.: Source-language features and maximum correlation training for machine translation evaluation. In: Proceedings of the 2007 Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), pp. 41–48 (2007)Google Scholar
  23. 23.
    Montague, M.H., Aslam, J.A.: Condorcet fusion for improved retrieval. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, McLean, VA, USA, 4–9 November 2002, pp. 538–548. ACM (2002)Google Scholar
  24. 24.
    Partridge, D., Krzanowski, W.: Software diversity: practical statistics for its measurement and exploitation. Inf. Softw. Technol. 39(10), 707–717 (1997)CrossRefGoogle Scholar
  25. 25.
    Sharkey, A.J. (ed.): Combining Artificial Neural Nets: Ensemble and Modular Multi-Net Systems, 1st edn. Springer-Verlag New York Inc., Secaucus (1999)zbMATHGoogle Scholar
  26. 26.
    Soboroff, I., Nicholas, C., Cahan, P.: Ranking retrieval systems without relevance judgments. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pp. 66–73. ACM, New York (2001)Google Scholar
  27. 27.
    Tversky, A.: Features of similarity. Psychol. Rev. 84, 327–352 (1977)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Enrique Amigó
    • 1
  • Fernando Giner
    • 1
  • Julio Gonzalo
    • 1
  • Felisa Verdejo
    • 1
  1. 1.NLP & IR Group at UNEDMadridSpain

Personalised recommendations