Abstract
This paper presents the efforts towards creating PoQuAD, a dataset for training automatic question answering models in Polish. It justifies why having native data is vital for training accurate Question Answering systems. PoQuAD broadly follows the methodology of SQuAD 2.0 (including impossible questions), but detracts from it in a few aspects. The first of these concerns reducing annotation density in order to broaden the range of topics included. The second is the inclusion of a generative answer layer to better suit the needs of a morphologically rich language. PoQuAD is a work in progress and so far consists of over 29000 question-answer pairs with contexts extracted from Polish Wikipedia. The planned size of the dataset is over 50 thousand such entries. The paper describes the annotation process and the guidelines which were given to annotators in order to ensure quality of the data. The collected data is subjected to analysis in order to shed some light on its linguistic properties and on the difficulty of the task.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
The repository at https://github.com/ipipan/poquad will be continually updated with new data. It is licensed on GNU GPL 3.0 license.
References
Ayoubi, S., Davoodeh, M.Y.: PersianQA: a dataset for Persian question answering. https://github.com/SajjjadAyobi/PersianQA (2021)
Borzymowski, H.: Polish QA model (2020), model trained on HuggingFace. https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2
Chrabrowa, A., et al.: Evaluation of transfer learning for polish with a text-to-text model. arXiv preprint arXiv:2205.08808 (2022)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR (2019). https://arxiv.org/abs/1911.02116
Cui, Y., et al.: A span-extraction dataset for Chinese machine reading comprehension. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5883–5889 Association for Computational Linguistics, Hong Kong, China (2019). https://doi.org/10.18653/v1/D19-1600
Dadas, S.: Polish BART. https://github.com/sdadas/polish-nlp-resources#bart
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR (2018). https://arxiv.org/abs/1810.04805
d’Hoffschmidt, M., Belblidia, W., Brendlé, T., Heinrich, Q., Vidal, M.: FQuAD: French question answering dataset (2020). https://arxiv.org/abs/2002.06071
Efimov, P., Chertok, A., Boytsov, L., Braslavski, P.: SberQuAD – Russian reading comprehension dataset: description and analysis. In: Arampatzis, A., et al. (eds.) CLEF 2020. LNCS, vol. 12260, pp. 3–15. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58219-7_1
Lim, S., Kim, M., Lee, J.: Korquad1.0: korean QA dataset for machine reading comprehension (2019). https://arxiv.org/abs/1909.07005
Macková, K., Straka, M.: Reading comprehension in Czech via machine translation and cross-lingual transfer (2020). https://arxiv.org/abs/2007.01667
Medved, M., Horak, A.: SQAD: Simple question answering database. In: RASLAN (2014)
Mroczkowski, R., Rybak, P., Wróblewska, A., Gawlik, I.: HerBERT: efficiently pretrained transformer-based language model for polish. In: Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing, pp. 1–10. Association for Computational Linguistics, Kiyv, Ukraine (2021). https://www.aclweb.org/anthology/2021.bsnlp-1.1
Möller, T., Risch, J., Pietsch, M.: GermanQuAD and GermanDPR: improving non-english question answering and passage retrieval (2021). https://arxiv.org/abs/2104.12741
Nguyen, K., Nguyen, V., Nguyen, A., Nguyen, N.: A Vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics. pp. 2595–2605. International Committee on Computational Linguistics, Barcelona, Spain (2020). https://doi.org/10.18653/v1/2020.coling-main.233
Ogrodniczuk, M., Przybyła, P.: PolEval 2021 task 4: question answering challenge (2021)
Rajpurkar, P., Jia, R., Liang, P.: Know what you don’t know: Unanswerable questions for squad (2018). https://doi.org/10.48550/ARXIV.1806.03822
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1264
Sabol, R., Medved’ M., Horák, A.: Czech question answering with extended sqad v3.0 benchmark dataset. In: Horák, A., Rychlý, P., Rambousek, A. (eds.) Proceedings of the Thirteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2019, pp. 99–108. Tribun EU, Brno (2019)
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label studio: data labeling software (2020–2022). https://github.com/heartexlabs/label-studio
Xue, L., et al.: mT5: a massively multilingual pre-trained text-to-text transformer. CoRR (2020). https://arxiv.org/abs/2010.11934
Šulganová, T., Marek, M., Horák, A.: Enlargement of the Czech question-answering dataset to SQAD v2.0. In: Proceedings of the Eleventh Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN, pp. 79–84. Brno (2017)
Acknowledgements
This work was supported by the European Regional Development Fund as a part of the 2014–2020 Smart Growth Operational Programme: (1) Intelligent travel search system based on natural language understanding algorithms, project no. POIR.01.01.01–00-0798/19; (2) CLARIN — Common Language Resources and Technology Infrastructure, project no. POIR.04.02.00-00C002/19.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Tuora, R., Zawadzka-Paluektau, N., Klamra, C., Zwierzchowska, A., Kobyliński, Ł. (2022). Towards a Polish Question Answering Dataset (PoQuAD). In: Tseng, YH., Katsurai, M., Nguyen, H.N. (eds) From Born-Physical to Born-Virtual: Augmenting Intelligence in Digital Libraries. ICADL 2022. Lecture Notes in Computer Science, vol 13636. Springer, Cham. https://doi.org/10.1007/978-3-031-21756-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-21756-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-21755-5
Online ISBN: 978-3-031-21756-2
eBook Packages: Computer ScienceComputer Science (R0)