Abstract
Recent studies have illuminated a pressing issue in the domain of natural language understanding (NLU) and reasoning: many of these datasets are imbued with subtle statistical cues. These cues, often unnoticed, provide sophisticated models an unintended edge, allowing them to exploit these patterns, leading to a potentially misleading overestimation of their genuine capabilities. While the existence of these cues has been noted, a precise and systematic identification has remained elusive in existing literature. Addressing this gap, our paper presents a novel lightweight framework. This framework is meticulously designed to not only detect these hidden biases in multiple-choice NLU datasets but also rigorously evaluate the robustness of models that are developed based on these datasets. By unveiling these biases and assessing model integrity, we aim to pave the way for more genuine and transparent advancements in NLU research.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bowman, S., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 632–642 (2015)
Clark, C., Yatskar, M., Zettlemoyer, L.: Don’t take the easy way out: ensemble based methods for avoiding known dataset biases. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 4060–4073 (2019)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Gururangan, S., Swayamdipta, S., Levy, O., Schwartz, R., Bowman, S., Smith, N.A.: Annotation artifacts in natural language inference data. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 107–112 (2018)
He, H., Zha, S., Wang, H.: Unlearn dataset bias in natural language inference by fitting the residual. EMNLP-IJCNLP 2019, 132 (2019)
Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: large-scale reading comprehension dataset from examinations. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 785–794 (2017)
Lowe, R., Pow, N., Serban, I.V., Pineau, J.: The ubuntu dialogue corpus: a large dataset for research in unstructured multi-turn dialogue systems. In: Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pp. 285–294 (2015)
McCoy, T., Pavlick, E., Linzen, T.: Right for the wrong reasons: diagnosing syntactic heuristics in natural language inference. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 3428–3448 (2019)
Mostafazadeh, N., et al.: A corpus and cloze evaluation for deeper understanding of commonsense stories. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 839–849 (2016)
Naik, A., Ravichander, A., Sadeh, N., Rose, C., Neubig, G.: Stress test evaluation for natural language inference. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2340–2353 (2018)
Niven, T., Kao, H.Y.: Probing neural network comprehension of natural language arguments. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4658–4664 (2019)
Poliak, A., Naradowsky, J., Haldar, A., Rudinger, R., Van Durme, B.: Hypothesis only baselines in natural language inference. In: Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pp. 180–191 (2018)
Ribeiro, M.T., Wu, T., Guestrin, C., Singh, S.: Beyond accuracy: Behavioral testing of NLP models with checklist. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5–10, 2020, pp. 4902–4912 (2020)
Roemmele, M., Bejan, C.A., Gordon, A.S.: Choice of plausible alternatives: an evaluation of commonsense causal reasoning. In: 2011 AAAI Spring Symposium Series (2011)
Sanchez, I., Mitchell, J., Riedel, S.: Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 1975–1985 (2018)
Schuster, T., Shah, D.J., Yeo, Y.J.S., Filizzola, D., Santus, E., Barzilay, R.: Towards debiasing fact verification models. arXiv preprint arXiv:1908.05267 (2019)
Sharma, R., Allen, J., Bakhshandeh, O., Mostafazadeh, N.: Tackling the story ending biases in the story cloze test. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 752–757 (2018)
Srinivasan, S., Arora, R., Riedl, M.: A simple and effective approach to the story cloze test. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 92–96 (2018)
Talmor, A., Herzig, J., Lourie, N., Berant, J.: CommonsenseQA: a question answering challenge targeting commonsense knowledge. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4149–4158 (2019)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, 353 (2018)
Yaghoobzadeh, Y., Tachet, R., Hazen, T., Sordoni, A.: Robust natural language inference models with example forgetting. arXiv preprint arXiv:1911.03861 (2019)
Yu, W., Jiang, Z., Dong, Y., Feng, J.: Reclor: A reading comprehension dataset requiring logical reasoning. arXiv preprint arXiv:2002.04326 (2020)
Zellers, R., Bisk, Y., Schwartz, R., Choi, Y.: Swag: A large-scale adversarial dataset for grounded commonsense inference. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 93–104 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huang, S. (2024). Can You Really Reason: A Novel Framework for Assessing Natural Language Reasoning Datasets and Models. In: Luo, B., Cheng, L., Wu, ZG., Li, H., Li, C. (eds) Neural Information Processing. ICONIP 2023. Communications in Computer and Information Science, vol 1969. Springer, Singapore. https://doi.org/10.1007/978-981-99-8184-7_5
Download citation
DOI: https://doi.org/10.1007/978-981-99-8184-7_5
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8183-0
Online ISBN: 978-981-99-8184-7
eBook Packages: Computer ScienceComputer Science (R0)