RuBQ: A Russian Dataset for Question Answering over Wikidata

Korablinov, Vladislav; Braslavski, Pavel

doi:10.1007/978-3-030-62466-8_7

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12507))

Included in the following conference series:

International Semantic Web Conference

3463 Accesses
7 Citations

Abstract

The paper presents RuBQ, the first Russian knowledge base question answering (KBQA) dataset. The high-quality dataset consists of 1,500 Russian questions of varying complexity, their English machine translations, SPARQL queries to Wikidata, reference answers, as well as a Wikidata sample of triples containing entities with Russian labels. The dataset creation started with a large collection of question-answer pairs from online quizzes. The data underwent automatic filtering, crowd-assisted entity linking, automatic generation of SPARQL queries, and their subsequent in-house verification.

The freely available dataset will be of interest for a wide community of researchers and practitioners in the areas of Semantic Web, NLP, and IR, especially for those working on multilingual question answering. The proposed dataset generation pipeline proved to be efficient and can be employed in other data annotation projects.

V. Korablinov—Work done as an intern at JetBrains Research.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://en.wikipedia.org/wiki/List_of_languages_by_total_number_of_speakers.
2.
See overview of previous QALD datasets in [34].
3.
We manually verified all the 558 Russian questions in the QALD-9 dataset – only two of them happen to be grammatical.
4.
http://baza-otvetov.ru, http://viquiz.ru, and others.
5.
Hereafter English examples are translations from original Russian questions and answers.
6.
https://dumps.wikimedia.org/other/pageviews/.
7.
https://toloka.ai/.
8.
We examined the sample and found out that there are only 12 questions with distances between question and answer entities in the Wikidata graph longer than two.
9.
https://translate.yandex.com/.
10.
https://zenodo.org/record/3751761, project’s page on github points here.
11.
Details about Wikidata statement types can be found here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Statement_types.
12.
http://docs.deeppavlov.ai/en/master/features/models/kbqa.html. The results reported below are as of April 2020; a newer model has been released in June 2020.
13.
https://qanswer-frontend.univ-st-etienne.fr/.
14.
https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en.

References

Artetxe, M., Ruder, S., Yogatama, D.: On the cross-lingual transferability of monolingual representations. arXiv preprint arXiv:1910.11856 (2019)
Bao, J., Duan, N., Yan, Z., Zhou, M., Zhao, T.: Constraint-based question answering with knowledge graph. In: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2503–2514 (2016)
Google Scholar
Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1533–1544 (2013)
Google Scholar
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., Taylor, J.: Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 1247–1250 (2008)
Google Scholar
Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015)
Burtsev, M., et al.: Deeppavlov: Open-source library for dialogue systems. In: Proceedings of ACL 2018, System Demonstrations, pp. 122–127 (2018)
Google Scholar
Cai, Q., Yates, A.: Large-scale semantic parsing via schema matching and lexicon extension. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 423–433 (2013)
Google Scholar
Clark, J.H., et al.: TyDi QA: a benchmark for information-seeking question answering in typologically diverse languages. arXiv preprint arXiv:2003.05002 (2020)
Diefenbach, D., Both, A., Singh, K., Maret, P.: Towards a question answering system over the semantic web. arXiv preprint arXiv:1803.00832 (2018)
Diefenbach, D., Giménez-García, J., Both, A., Singh, K., Maret, P.: QAnswer KG: designing a portable question answering system over RDF data. In: Hart, A., et al. (eds.) ESWC 2020. LNCS, vol. 12123, pp. 429–445. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-49461-2_25
Chapter Google Scholar
Diefenbach, D., Tanon, T.P., Singh, K.D., Maret, P.: Question answering benchmarks for wikidata. In: ISWC (Posters & Demonstrations) (2017)
Google Scholar
Duan, N.: Overview of the NLPCC 2019 shared task: open domain semantic parsing. In: Tang, J., Kan, M.-Y., Zhao, D., Li, S., Zan, H. (eds.) NLPCC 2019. LNCS (LNAI), vol. 11839, pp. 811–817. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32236-6_74
Chapter Google Scholar
Dubey, M., Banerjee, D., Abdelkawi, A., Lehmann, J.: LC-QuAD 2.0: a large dataset for complex question answering over wikidata and DBpedia. In: Ghidini, C., et al. (eds.) ISWC 2019. LNCS, vol. 11779, pp. 69–78. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30796-7_5
Chapter Google Scholar
Elsahar, H., Gravier, C., Laforest, F.: Zero-shot question generation from knowledge graphs for unseen predicates and entity types. In: NAACL, pp. 218–228 (2018)
Google Scholar
Ferrucci, D., et al.: Building watson: an overview of the deepQA project. AI Mag. 31(3), 59–79 (2010)
Google Scholar
Hakimov, S., Jebbara, S., Cimiano, P.: AMUSE: multilingual semantic parsing for question answering over linked data. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10587, pp. 329–346. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68288-4_20
Chapter Google Scholar
Indurthi, S.R., Raghu, D., Khapra, M.M., Joshi, S.: Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pp. 376–385 (2017)
Google Scholar
Ipeirotis, P.G., Provost, F., Sheng, V.S., Wang, J.: Repeated labeling using multiple noisy labelers. Data Min. Knowl. Discov. 28(2), 402–441 (2014)
Article MathSciNet Google Scholar
Jiang, K., Wu, D., Jiang, H.: FreebaseQA: a new factoid QA data set matching trivia-style question-answer pairs with Freebase. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 318–323 (2019)
Google Scholar
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In: ACL, pp. 1601–1611 (2017)
Google Scholar
Keysers, D., et al.: Measuring compositional generalization: a comprehensive method on realistic data. In: ICLR (2020)
Google Scholar
Lehmann, J., et al.: DBpedia-a large-scale, multilingual knowledge base extracted from Wikipedia. Seman. Web 6(2), 167–195 (2015)
Article Google Scholar
Levy, O., Seo, M., Choi, E., Zettlemoyer, L.: Zero-shot relation extraction via reading comprehension. In: CoNLL, pp. 333–342 (2017)
Google Scholar
Lewis, P., Oğuz, B., Rinott, R., Riedel, S., Schwenk, H.: MLQA: evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475 (2019)
Pellissier Tanon, T., Vrandečić, D., Schaffert, S., Steiner, T., Pintscher, L.: From freebase to wikidata: the great migration. In: Proceedings of the 25th international conference on world wide web, pp. 1419–1428 (2016)
Google Scholar
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: unanswerable questions for SQuAD. In: ACL, pp. 784–789 (2018)
Google Scholar
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: EMNLP, pp. 2383–2392 (2016)
Google Scholar
Saha, A., Pahuja, V., Khapra, M.M., Sankaranarayanan, K., Chandar, S.: Complex sequential question answering: towards learning to converse over linked question answer pairs with a knowledge graph. arXiv preprint (2018)
Serban, I.V., et al.: Generating factoid questions with recurrent neural networks: the 30M factoid question-answer corpus. In: ACL, pp. 588–598 (2016)
Google Scholar
Su, Y., et al.: On generating characteristic-rich question sets for QA evaluation. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 562–572 (2016)
Google Scholar
Talmor, A., Berant, J.: The Web as a knowledge base for answering complex questions. In: NAACL, pp. 641–651 (2018)
Google Scholar
Trivedi, P., Maheshwari, G., Dubey, M., Lehmann, J.: LC-QuAD: a corpus for complex question answering over knowledge graphs. In: d’Amato, C., et al. (eds.) ISWC 2017. LNCS, vol. 10588, pp. 210–218. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68204-4_22
Chapter Google Scholar
Usbeck, R., Gusmita, R.H., Axel-Cyrille Ngonga Ngomo, Saleem, M.: 9th challenge on question answering over linked data (QALD-9). In: SemDeep-4, NLIWoD4, and QALD-9 Joint Proceedings, pp. 58–64 (2018)
Google Scholar
Usbeck, R., et al.: Benchmarking question answering systems. Semant. Web 10(2), 293–304 (2019)
Article Google Scholar
Völske, M., et al.: What users ask a search engine: analyzing one billion Russian question queries. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1571–1580 (2015)
Google Scholar
Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)
Article Google Scholar
Wu, Z., Kao, B., Wu, T.H., Yin, P., Liu, Q.: PERQ: Predicting, explaining, and rectifying failed questions in KB-QA systems. In: Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 663–671 (2020)
Google Scholar
Yih, W.T., Richardson, M., Meek, C., Chang, M.W., Suh, J.: The value of semantic parse labeling for knowledge base question answering. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 201–206 (2016)
Google Scholar
Zhang, X., Yang, A., Li, S., Wang, Y.: Machine reading comprehension: a literature review. arXiv preprint arXiv:1907.01686 (2019)

Download references

Acknowledgments

We thank Mikhail Galkin, Svitlana Vakulenko, Daniil Sorokin, Vladimir Kovalenko, Yaroslav Golubev, and Rishiraj Saha Roy for their valuable comments and fruitful discussion on the paper draft. We also thank Pavel Bakhvalov, who helped collect RuWikidata8M sample and contributed to the first version of the entity linking tool. We are grateful to Yandex.Toloka for their data annotation grant. PB acknowledges support by Ural Mathematical Center under agreement No. 075-02-2020-1537/1 with the Ministry of Science and Higher Education of the Russian Federation.

Author information

Authors and Affiliations

ITMO University, Saint Petersburg, Russia
Vladislav Korablinov
Ural Federal University, Yekaterinburg, Russia
Pavel Braslavski
HSE University, Saint Petersburg, Russia
Pavel Braslavski
JetBrains Research, Saint Petersburg, Russia
Pavel Braslavski

Authors

Vladislav Korablinov
View author publications
You can also search for this author in PubMed Google Scholar
Pavel Braslavski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pavel Braslavski .

Editor information

Editors and Affiliations

University of Edinburgh, Edinburgh, UK
Jeff Z. Pan
University of Liverpool, Liverpool, UK
Valentina Tamma
University of Bari, Bari, Italy
Claudia d’Amato
University of California, Santa Barbara, Santa Barbara, CA, USA
Krzysztof Janowicz
California State University, Long Beach, Long Beach, CA, USA
Bo Fu
Vienna University of Economics and Business, Vienna, Austria
Axel Polleres
Rensselaer Polytechnic Institute, Troy, NY, USA
Oshani Seneviratne
Massachusetts Institute of Technology, Cambridge, MA, USA
Lalana Kagal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Korablinov, V., Braslavski, P. (2020). RuBQ: A Russian Dataset for Question Answering over Wikidata. In: Pan, J.Z., et al. The Semantic Web – ISWC 2020. ISWC 2020. Lecture Notes in Computer Science(), vol 12507. Springer, Cham. https://doi.org/10.1007/978-3-030-62466-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-030-62466-8_7
Published: 01 November 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-62465-1
Online ISBN: 978-3-030-62466-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)