Abstract
We present a Stratified MAchine Reading Test (SMART) data set for Chinese in which each question is assigned a “level” that reflects the type of reasoning that is needed to answer the question. This data set consists of close to 40 K question-answer pairs and its stratified design allows machine reading researchers to quickly focus in on areas that present the most challenge for a machine comprehension system. We further establish a baseline for future research with BERT, and present results that show the levels we have designed correspond well with the level of difficulty that BERT experiences in answering these questions, as reflected by the lower accuracy for higher levels. We have also collected human answers to the questions in the test portion of this data set, and show that humans and the machine have different challenges when answering these questions. This means that even though the machine is approaching human-level performance on this task, humans and the machine perform this task with very different mechanisms.
We would like to thank the students from Ludong University, particularly Liang Jian ( ), Xu Yuanyuan ( ), Shang Guofeng ( ), and students from Nanjing Normal University, particularly Liu Han ( ), Cao Ziyan ( ), Mao Xuefen ( ) for their assistance with data preparation. The second author would like to acknowledge the support from a National Language Committee project (YB135-23) and a Jiangsu Higher Institutions’ Excellent Innovative Team for Philosophy and Social Sciences project (2017STD006). The third author would like to acknowledge the support of a National Language Committee “13th Five-Year” Research Plan project (ZD\(\vert \)135-22).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
See the leadboard at https://rajpurkar.github.io/SQuAD-explorer/. On SQuAD 1.0, a number of systems have surpassed human performance, and on SQuAD 2.0, the state of the art systems is approaching human performance.
- 2.
Data will be made available here: https://www.cs.brandeis.edu/~clp/smart.
- 3.
References
Chen, C., Ng, V.: Chinese zero pronoun resolution: some recent advances. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)
Clark, P., et al.: Think you have solved question answering? try arc, the AI2 reasoning challenge. CoRR abs/1803.05457 (2018). http://arxiv.org/abs/1803.05457
Cui, Y., Liu, T., Chen, Z., Wang, S., Hu, G.: Consensus attention-based neural networks for chinese reading comprehension. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers (2016)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dunn, M., Sagun, L., Higgins, M., Güney, V.U., Cirik, V., Cho, K.: SearchQA: a new Q&A dataset augmented with context from a search engine. CoRR abs/1704.05179 (2017). http://arxiv.org/abs/1704.05179
He, W., et al.: DuReader: a Chinese machine reading comprehension dataset from real-world applications. In: Proceedings of the Workshop on Machine Reading for Question Answering, pp. 37–46 (2018)
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Vancouver, Canada, July 2017
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S., Roth, D.: Looking beyond the surface: a challenge set for reading comprehension over multiple sentences. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), pp. 252–262 (2018)
Kocisky, T., et al.: The narrativeqa reading comprehension challenge. Trans. Assoc. Comput. Linguis. 6, 317–328 (2018)
Lai, G., Xie, Q., Liu, H., Yang, Y., Hovy, E.: RACE: large-scale reading comprehension dataset from examinations. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (2017)
Lee, K., He, L., Lewis, M., Zettlemoyer, L.: End-to-end neural coreference resolution. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark (2017)
Ng, V., Cardie, C.: Improving machine learning approaches to coreference resolution. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002)
Raghunathan, K., et al.: A multi-pass sieve for coreference resolution. In: Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing (2010)
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016)
Richardson, M., Burges, C.J., Renshaw, E.: MCTest: a challenge dataset for the open-domain machine comprehension of text. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (2013)
Shao, C., Liu, T., Lai, Y., Tseng, Y., Tsai, S.: DRCD: a Chinese machine reading comprehension dataset. CoRR abs/1806.00920 (2018). http://arxiv.org/abs/1806.00920
Soon, W.M., Ng, H.T., Lim, D.C.Y.: A machine learning approach to coreference resolution of noun phrases. Comput. Linguist. 27(4), 521–544 (2001)
Trischler, A., et al.: NewsQA: a machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP (2017)
Welbl, J., Stenetorp, P., Riedel, S.: Constructing datasets for multi-hop reading comprehension across documents. Trans. Assoc. Comput. Linguist. 6, 287–302 (2018)
Xue, N., Ng, H.T., Pradhan, S., Prasad, R., Bryant, C., Rutherford, A.: The CoNLL-2015 shared task on shallow discourse parsing. In: Proceedings of the Nineteenth Conference on Computational Natural Language Learning-Shared Task, pp. 1–16 (2015)
Xue, N., et al.: CoNLL 2016 shared task on multilingual shallow discourse parsing. In: Proceedings of the CoNLL-16 shared task (2016)
Zhao, S., Ng, H.T.: Identification and resolution of Chinese zero pronouns: a machine learning approach. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) (2007)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Yao, J., Feng, M., Feng, H., Wang, Z., Zhang, Y., Xue, N. (2019). SMART: A Stratified Machine Reading Test. In: Tang, J., Kan, MY., Zhao, D., Li, S., Zan, H. (eds) Natural Language Processing and Chinese Computing. NLPCC 2019. Lecture Notes in Computer Science(), vol 11838. Springer, Cham. https://doi.org/10.1007/978-3-030-32233-5_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-32233-5_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32232-8
Online ISBN: 978-3-030-32233-5
eBook Packages: Computer ScienceComputer Science (R0)