Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Van Nguyen, Kiet; Do, Phong Nguyen-Thuan; Nguyen, Nhat Duy; Nguyen, Anh Gia-Tuan; Nguyen, Ngan Luu-Thuy

doi:10.1007/s13042-022-01735-z

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Original Article
Published: 30 January 2023

Volume 14, pages 1877–1902, (2023)
Cite this article

International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Kiet Van Nguyen ORCID: orcid.org/0000-0002-8456-2742^1,2,
Phong Nguyen-Thuan Do^1,2,
Nhat Duy Nguyen^1,2,
Anh Gia-Tuan Nguyen^1,2 &
…
Ngan Luu-Thuy Nguyen^1,2

493 Accesses
3 Citations
1 Altmetric
Explore all metrics

Abstract

With the fast growth of information science and engineering, a large number of textual data generated are valuable for natural language processing and its applications. Particularly, finding correct answers to natural language questions or queries requires spending tremendous time and effort in human life. While using search engines to discover information, users manually determine the answer to a given question on a range of retrieved texts or documents. Question answering relies heavily on the capability to automatically comprehend questions in human language and extract meaningful answers from a single text. In recent years, such question–answering systems have become increasingly popular using machine reading comprehension techniques. On the other hand, high-resource languages (e.g., English and Chinese) have witnessed tremendous growth in question-answering methodologies based on various knowledge sources. Besides, powerful BERTology-based language models only encode texts with a limited length. The longer texts contain more distractor sentences that affect the QA system performance. Vietnamese has a variety of question words in the same question type. To address these challenges, we propose ViQAS, a new question–answering system with multi-stage transfer learning using language models based on BERTology for a low-resource language such as Vietnamese. Last but not least, our QA system is integrated with Vietnamese characteristics and transformer-based evidence extraction techniques into an effective contextualized language model-based QA system. As a result, our proposed system outperforms our forty retriever-reader QA configurations and seven state-of-the-art QA systems such as DrQA, BERTserini, BERTBM25, XLMRQA, ORQA, COBERT, and NeuralQA on three Vietnamese benchmark question answering datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Answer Agnostic Question Generation in Bangla Language

Article Open access 03 January 2024

Open Domain Question Answering Based on Retriever-Reader Architecture

XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-Based Textual Knowledge Source

Notes

A list of phrases “noun + gì/nào” is available at https://link.uit.edu.vn/Whatphrase.
https://www.sbert.net/index.html.
The two datasets are available freely for research purposes.
https://huggingface.co/docs/transformers/index.
https://colab.research.google.com/.
https://pytorch.org/.
To further evaluate the type of quantitative question, we classify Others in ViQuAD and ViNewsQA into two types: How many and Others.

References

Alzubi JA, Jain R, Singh A, Parwekar P, Gupta M (2021) Cobert: covid-19 question answering system using bert. Arab J Sci Eng:1–11
Antol S, Agrawal A, Lu J, Mitchell M, Batr D, Zitnick CL, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433
Bach NX, Thanh PD, Oanh TT (2020) Question analysis towards a vietnamese question answering system in the education domain. Cybern Inform Technol 20(1):112–128
Google Scholar
Bai Y, Wang DZ (2021) More than reading comprehension: A survey on datasets and metrics of textual question answering. arXiv preprint arXiv:2109.12264
Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia—a crystallization point for the web of data. J Web Seman 7(3):154–165
Article Google Scholar
Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250
Braslavski P (2020) Sberquad–russian reading comprehension dataset: Description and analysis. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, vol. 12260, pp 3. Springer Nature
Chen D, Bolton J, Manning CD (2016) A thorough examination of the cnn/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2358–2367
Chen D, Fisch A, Weston J, Bordes A (2017) Reading wikipedia to answer open-domain questions. Proc ACL 2017:1870–1879
Google Scholar
Chen D, Yih W-T (2020) Open-domain question answering. Proc ACL 2020:34–37
Google Scholar
Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán Francisco, Grave Édouard, Ott Myle, Zettlemoyer Luke, Stoyanov Veselin (2020) Unsupervised cross-lingual representation learning at scale. Proc ACL 2020:8440–8451
Google Scholar
Cui Y, Liu T, Che W, Xiao L, Chen , Ma W, Wang S, Hu G (2019) A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of EMNLP-IJCNLP 2019, pp 5883–5889, Hong Kong, Chinar. Association for Computational Linguistics
Cui Y, Liu T, Che W, Xiao L, Chen Z, Ma Wentao, Wang Shijin, Guoping Hu (2019) A span-extraction dataset for chinese machine reading comprehension. Proc EMNLP-IJCNLP 2019:5883–5889
Google Scholar
Das R, Dhuliawala S, Zaheer M, McCallum A (2018) Multi-step retriever-reader interaction for scalable open-domain question answering. In: ICLR
Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Proc NAACL 2019:4171–4186
Google Scholar
d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) FQuAD: french question answering dataset. In: EMNLP 2020 (Findings), pp 1193–1208, Online. Association for Computational Linguistics
Dibia V (2020) Neuralqa: a usable library for question answering (contextual query expansion+ bert) on large datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 15–22
Do Phong N-T, Nguyen Nhat D, Van Huynh T, Van Nguyen K, Nguyen Anh G-T, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. In Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, and Sun-Yuan Kung, editors, Knowledge Science, Engineering and Management - 14th International Conference, KSEM 2021, Tokyo, Japan, August 14-16, 2021, Proceedings, Part II, volume 12816 of Lecture Notes in Computer Science, pp 511–523. Springer
Do Phong N-T, Nguyen ND, Van Huynh T, Van Nguyen K, Gia-Tuan NA, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. Knowl Sci Eng Manag. In: 14th International Conference
Do P, Phan THV (2022) Developing a bert based triple classification model using knowledge graph embedding for question answering system. Appl Intell 52(1):636–651
Article Google Scholar
Do P, Phan THV, Gupta BB (2021) Developing a vietnamese tourism question answering system using knowledge graph and deep learning. Transa Asian Low-Resou Lang Inform Process 20(5):1–18
Article Google Scholar
Doan AL, Luu ST (2022) Improving sentiment analysis by emotion lexicon approach on vietnamese texts. arXiv preprint arXiv:2210.02063
Dua D, Wang Y, Dasigi P, Stanovsky G, Singh S, Gardner M (2019) Drop: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In: NAACL-HLT (1)
Duong Huu-Thanh, Ho Bao-Quoc (2015) A vietnamese question answering system in vietnam’s legal documents. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp 186–197. Springer
d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) Fquad: French question answering dataset. In: Proceedings of EMNLP 2020 (Findings), pp 1193–1208
Efimov P, Chertok A, Boytsov L, Braslavski P (2020) Sberquad–russian reading comprehension dataset: description and analysis. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp 3–15. Springer
Feldman Y, El-Yaniv R (2019) Multi-hop paragraph retrieval for open-domain question answering. Proc ACL 2019:2296–2309
Google Scholar
Green Jr BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In: Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp 219–224
Guu K, Lee K, Tung Z, Pasupat P, Chang M (2020) Retrieval augmented language model pre-training. In: International Conference on Machine Learning, pp 3929–3938. PMLR
Harabagiu S, Moldovan D, Clark C, Bowden M, Williams J, Bensley J (2003) Answer mining by combining extraction techniques with abductive reasoning. Proc. TREC 2003:375–382
Google Scholar
Harabagiu S, Pasca M, Maiorano SJ (2000) Experiments with open-domain textual question answering. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
Hedderich MA, Lange L, Adel H, Strötgen J, Klakow D (2021) A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of NAACL 2021, pp 2545–2568
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Adv Neural Inform Process Systems 28
Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of ACL 2018 (Volume 1: Long Papers), pp 328–339
Huang H-Y, Zhu C, Shen Y, Weizhu C (2018) Fusing via fully-aware attention with application to machine comprehension. In: ICLR, Fusionnet
Izacard G, Grave E (2021) Distilling knowledge from reader to retriever for question answering. In: ICLR 2021
Izacard G, Grave É (2021) Leveraging passage retrieval with generative models for open domain question answering. Proc EACL 2021:874–880
Google Scholar
Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Understand 163:3–20
Article Google Scholar
Karpukhin V, Oguz B, Min S, Lewis P, Ledell W, Edunov S, Chen D, Yih W-T (2020) Dense passage retrieval for open-domain question answering. Proc EMNLP 2020:6769–6781
Google Scholar
Kratzwald B, Eigenmann A, Feuerriegel S (2019) Rankqa: neural question answering with answer re-ranking. Proc ACL 2019:6076–6085
Google Scholar
Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. In: ICLR 2019
Lee K, Chang M-W, Toutanova K (2019) Latent retrieval for weakly supervised open domain question answering. Proc ACL 2019:6086–6096
Google Scholar
Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W-t, Rocktäschel T et al (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inform Process Syst 33:9459–9474
Google Scholar
Lim S, Kim M, Lee J (2019) Korquad1.0: Korean qa dataset for machine reading comprehension. arXiv preprint arXiv:1909.07005
Lin J, Ma X, Lin S-C, Yang J-H, Pradeep R, Nogueira R (2021) Pyserini: a python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 2356–2362
Lin Y, Ji H, Liu Z, Sun M (2018) Denoising distantly supervised open-domain question answering. Proc ACL 2018:1736–1745
Google Scholar
Liu S, Zhang X, Zhang S, Wang H, Zhang W (2019) Neural machine reading comprehension: methods and trends. Appl Sci 9(18):3698
Article Google Scholar
Messaoudi A, Haddad H, Ben Haj HM (2020) icompass at semeval-2020 task 12: from a syntax-ignorant n-gram embeddings model to a deep bidirectional language model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp 1978–1982
Min S, Chen D, Zettlemoyer L, Hajishirzi H (2019) Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868
Nguyen DQ, Tuan NA (2020) PhoBERT: pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 1037–1042, Online. Association for Computational Linguistics
Nguyen Kiet, Nguyen Vu, Nguyen Anh, Nguyen Ngan (2020) A vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 2595–2605
Van Nguyen K, Do Phong N-T, Nguyen ND, Van Huynh T, Nguyen AG-T, Nguyen Ngan L-T (2022) Xlmrqa: Open-domain question answering on vietnamese wikipedia-based textual knowledge source. In: the 14th Asian Conference on Intelligent Information and Database Systems (Accepted)
Nogueira R, Cho K (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085
Noraset T, Lowphansirikul L, Tuarob S (2021) Wabiqa: a wikipedia-based thai question-answering system. Inform Process Manag 58(1):102431
Article Google Scholar
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL 2018, Volume 1 (Long Papers), pp 2227–2237
Phan T, Do P (2021) Building a vietnamese question answering system based on knowledge graph and distributed cnn. Neural Comput Appl: 1–21
Pyysalo S, Kanerva J, Virtanen A, Ginter F (2021) Wikibert models: deep transfer learning for many languages. NoDaLiDa 2021, pp 1
Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2383–2392
Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Assoc Comput Linguist 11
Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11
Richardson M, Burges Christopher JC, Renshaw E (2013) Mctest: a challenge dataset for the open-domain machine comprehension of text. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 193–203
Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. TACL 8:842–866
Article Google Scholar
Seo M, Kembhavi A, Farhadi A, Hajishirzi H (2016) Bidirectional attention flow for machine comprehension. arXiv preprint. arXiv:1611.01603
So BH, Byun K, Kang K, Cho S (2022) Jaquad: Japanese question answering dataset for machine reading comprehension. arXiv preprint. arXiv:2202.01764
Tapeh AG, Rahgozar M (2008) A knowledge-based question answering system for b2c ecommerce. Knowl Based Syst 21(8):946–950
Article Google Scholar
Tran M-V, Le D-T, Tran XT, Nguyen T-T (2012) A model of vietnamese person named entity question answering system. In: Proceedings of PACLIC 2012, pp 325–332
Tran TK (2015) Sentivoice-a system for querying hotel service reviews via phone. In: RIVF 2015, pp 65–70. IEEE
Trotman A, Puurula A, Burgess B (2014) Improvements to bm25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium, pp 58–65
Van HT, Van Nguyen K, Nguyen NL-T (2022) Vinli: a vietnamese corpus for studies on open-domain natural language inference. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 3858–3872
Van Nguyen K, Nguyen ND, Do PN-T, Nguyen AG-T, Nguyen NL-T (2021) Vireader: a wikipedia-based vietnamese reading comprehension system using transfer learning. J Intell Fuzzy Syst 41:1–19
Google Scholar
Van Nguyen K, Tran KV, Luu ST, Nguyen AG-T, Nguyen NL-T (2020) Enhancing lexical-based approach with external knowledge for vietnamese multiple-choice machine reading comprehension. IEEE Access 8:201404–201417
Article Google Scholar
Van Nguyen K, Van Huynh T, Nguyen D-V, Nguyen AG-T, Nguyen NL-T (2022) New vietnamese corpus for machine reading comprehension of health news articles. Trans Asian Low-Resour Lang Inform Process 21(5):1–28
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008
Voorhees Ellen M et al. (1999) The trec-8 question answering track report. In: Trec, vol. 99, pp 77–82
Wang H, Dian Y, Sun K, Chen J, Dong Y, McAllester D, Roth D (2019) Evidence sentence extraction for machine reading comprehension. Proc CoNLL 2019:696–707
Google Scholar
Wang S, Yu M, Guo X, Wang Z, Klinger T, Zhang W, Chang S, Tesauro G, Zhou B, Jiang J (2018) R3: reinforced ranker-reader for open-domain question answering. In: AAAI 2018
Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: a globally normalized bert model for open-domain question answering. Proc EMNLP-IJCNLP 2019:5878–5882
Google Scholar
Woods WA (1973) Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, national computer conference and exposition, pp 441–450
Wu B, Zhang H, Li MY, Wang Z, Feng Q, Huang J, Wang B (2020) Towards non-task-specific distillation of bert via sentence representation approximation. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp 70–79
Xiong W, Li X, Iyer S, Du J, Lewis P, Wang WY, Mehdad Y, Yih S, Riedel S, Kiela D, et al. (2020) Answering complex open-domain questions with multi-hop dense retrieval. In: ICML 2020
Yang W, Xie Y, Lin A, Li X, Tan L, Xiong K, Li M, Lin J (2019) End-to-end open-domain question answering with bertserini. Proc NAACL 2019:72–77
Google Scholar
Yang Z, Qi P, Zhang S, Bengio Y, Cohen W, Salakhutdinov R, Manning CD (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of EMNLP 2018, pp 2369–2380
Zhang Z, Zhao H, Wang R (2020) Machine reading comprehension: the role of contextualized language models and beyond. Computat Ling 1(1)
Zhao T, Xiaopeng L, Lee K (2021) Sparta: efficient open-domain question answering via sparse transformer matching retrieval. Proceedings of NAACL 2021:565–575
Google Scholar
Zhu F, Lei W, Wang C, Zheng J, Poria S, Chua T-S (2021) Retrieving and reading: a comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774

Download references

Acknowledgements

This research was supported by The VNUHCM-University of Information Technology's Scientific Research Support Fund.

Author information

Authors and Affiliations

University of Information Technology, Ho Chi Minh City, Vietnam
Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Anh Gia-Tuan Nguyen & Ngan Luu-Thuy Nguyen
Vietnam National University, Ho Chi Minh City, Vietnam
Kiet Van Nguyen, Phong Nguyen-Thuan Do, Nhat Duy Nguyen, Anh Gia-Tuan Nguyen & Ngan Luu-Thuy Nguyen

Authors

Kiet Van Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Phong Nguyen-Thuan Do
View author publications
You can also search for this author in PubMed Google Scholar
Nhat Duy Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Anh Gia-Tuan Nguyen
View author publications
You can also search for this author in PubMed Google Scholar
Ngan Luu-Thuy Nguyen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ngan Luu-Thuy Nguyen.

Ethics declarations

Conflict of interest

The authors affirm that they do not have any conflicting interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Van Nguyen, K., Do, P.NT., Nguyen, N.D. et al. Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese. Int. J. Mach. Learn. & Cyber. 14, 1877–1902 (2023). https://doi.org/10.1007/s13042-022-01735-z

Download citation

Received: 01 June 2022
Accepted: 29 November 2022
Published: 30 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s13042-022-01735-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Abstract

Access this article

Similar content being viewed by others

Answer Agnostic Question Generation in Bangla Language

Open Domain Question Answering Based on Retriever-Reader Architecture

XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-Based Textual Knowledge Source

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

Abstract

Access this article

Similar content being viewed by others

Answer Agnostic Question Generation in Bangla Language

Open Domain Question Answering Based on Retriever-Reader Architecture

XLMRQA: Open-Domain Question Answering on Vietnamese Wikipedia-Based Textual Knowledge Source

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation