Testing the Reasoning Power for NLI Models with Annotated Multi-perspective Entailment Dataset

  • Dong YuEmail author
  • Lu Liu
  • Chen Yu
  • Changliang Li
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11856)


Natural language inference (NLI) is a challenging task to determine the relationship between a pair of sentences. Existing Neural Network-based (NN-based) models have achieved prominent success. However, rare models are interpretable. In this paper, we propose a Multi-perspective Entailment Category Labeling System (METALs). It consists of three categories, ten sub-categories. We manually annotate 3,368 entailment items. The annotated data is used to explain the recognition ability of four NN-based models at a fine-grained level. The experimental results show that all the models have poor performance in the commonsense reasoning than in other entailment categories. The highest accuracy difference is 13.22%.


Natural Language Inference Multi-perspective Entailment Category Labeling System Entailment categories 



This work is funded by National Key R&D Program of China, “Cloud computing and big data” key projects (2018YFB1005105).


  1. 1.
    Akhmatova, E., Dras, M.: Using hypernymy acquisition to tackle (part of) textual entailment. In: Proceedings of the 2009 Workshop on Applied Textual Inference, pp. 52–60. Association for Computational Linguistics (2009)Google Scholar
  2. 2.
    Bentivogli, L., Cabrio, E., Dagan, I., Giampiccolo, D., Leggio, M.L., Magnini, B.: Building textual entailment specialized data sets: a methodology for isolating linguistic phenomena relevant to inference. In: LREC. Citeseer (2010)Google Scholar
  3. 3.
    Bentivogli, L., Clark, P., Dagan, I., Giampiccolo, D.: The fifth PASCAL recognizing textual entailment challenge. In: TAC (2009)Google Scholar
  4. 4.
    Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642. Association for Computational Linguistics, Lisbon (2015).
  5. 5.
    Carmona, V.I.S., Mitchell, J., Riedel, S.: Behavior analysis of NLI models: uncovering the influence of three factors on robustness (2018)Google Scholar
  6. 6.
    Chen, Q., Zhu, X., Ling, Z.H., Wei, S., Jiang, H., Inkpen, D.: Recurrent neural network-based sentence encoder with gated attention for natural language inference. arXiv preprint arXiv:1708.01353 (2017)
  7. 7.
    Clark, P., Murray, W.R., Thompson, J., Harrison, P., Hobbs, J., Fellbaum, C.: On the role of lexical and world knowledge in RTE3. In: Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing, pp. 54–59. Association for Computational Linguistics (2007)Google Scholar
  8. 8.
    Conneau, A., Kiela, D., Schwenk, H., Barrault, L., Bordes, A.: Supervised learning of universal sentence representations from natural language inference data (2017)Google Scholar
  9. 9.
    Dagan, I., Glickman, O., Magnini, B.: The PASCAL recognising textual entailment challenge. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 177–190. Springer, Heidelberg (2006). CrossRefGoogle Scholar
  10. 10.
    Demszky, D., Guu, K., Liang, P.: Transforming question answering datasets into natural language inference datasets (2018)Google Scholar
  11. 11.
    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
  12. 12.
    Garoufi, K.: Towards a better understanding of applied textual entailment. Ph.D. thesis, Citeseer (2007)Google Scholar
  13. 13.
    Ghaeini, R., et al.: DR-BiLSTM: dependent reading bidirectional LSTM for natural language inference. arXiv preprint arXiv:1802.05577 (2018)
  14. 14.
    Glockner, M., Shwartz, V., Goldberg, Y.: Breaking NLI systems with sentences that require simple lexical inferences. arXiv preprint arXiv:1805.02266 (2018)
  15. 15.
    Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S.R., Smith, N.A.: Annotation artifacts in natural language inference data (2018)Google Scholar
  16. 16.
    Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019)
  17. 17.
    Liu, Y., Sun, C., Lin, L., Wang, X.: Learning natural language inference using bidirectional LSTM model and inner-attention. arXiv preprint arXiv:1605.09090 (2016)
  18. 18.
    Marelli, M., Bentivogli, L., Baroni, M., Bernardi, R., Menini, S., Zamparelli, R.: Semeval-2014 task 1: evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment. In: Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), pp. 1–8 (2014)Google Scholar
  19. 19.
    McCann, B., Keskar, N.S., Xiong, C., Socher, R.: The natural language decathlon: multitask learning as question answering. arXiv preprint arXiv:1806.08730 (2018)
  20. 20.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995) CrossRefGoogle Scholar
  21. 21.
    Mou, L., et al.: Natural language inference by tree-based convolution and heuristic matching. In: Meeting of the Association for Computational Linguistics (2016)Google Scholar
  22. 22.
    Naik, A., Ravichander, A., Sadeh, N., Rose, C., Neubig, G.: Stress test evaluation for natural language inference (2018)Google Scholar
  23. 23.
    Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933 (2016)
  24. 24.
    Qian, C., Zhu, X., Ling, Z.H., Si, W., Inkpen, D.: Enhanced LSTM for natural language inference (2017)Google Scholar
  25. 25.
    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving language understanding with unsupervised learning, Technical report, OpenAI (2018)Google Scholar
  26. 26.
    Roberts, K.: Building an annotated textual inference corpus for motion and space. In: Proceedings of the 2009 Workshop on Applied Textual Inference, pp. 48–51. Association for Computational Linguistics (2009)Google Scholar
  27. 27.
    Sammons, M., Vydiswaran, V., Roth, D.: Ask not what textual entailment can do for you... In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. pp. 1199–1208. Association for Computational Linguistics (2010)Google Scholar
  28. 28.
    Tay, Y., Tuan, L.A., Hui, S.C.: Compare, compress and propagate: enhancing neural architectures with alignment factorization for natural language inference. arXiv preprint arXiv:1801.00102 (2017)
  29. 29.
    Vanderwende, L., Dolan, W.B.: What syntax can contribute in the entailment task. In: Quiñonero-Candela, J., Dagan, I., Magnini, B., d’Alché-Buc, F. (eds.) MLCW 2005. LNCS (LNAI), vol. 3944, pp. 205–216. Springer, Heidelberg (2006). Scholar
  30. 30.
    Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008 (2017)Google Scholar
  31. 31.
    Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences (2017)Google Scholar
  32. 32.
    Welleck, S., Weston, J., Szlam, A., Cho, K.: Dialogue natural language inference (2018)Google Scholar
  33. 33.
    Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for sentence understanding through inference (2017)Google Scholar
  34. 34.
    Zhang, Z., et al.: I know what you want: semantic learning for text comprehension. arXiv preprint arXiv:1809.02794 (2018)

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Beijing Language and Culture UniversityBeijingChina
  2. 2.Kingsoft AI LabBeijingChina

Personalised recommendations