Skip to main content
Log in

Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

With the fast growth of information science and engineering, a large number of textual data generated are valuable for natural language processing and its applications. Particularly, finding correct answers to natural language questions or queries requires spending tremendous time and effort in human life. While using search engines to discover information, users manually determine the answer to a given question on a range of retrieved texts or documents. Question answering relies heavily on the capability to automatically comprehend questions in human language and extract meaningful answers from a single text. In recent years, such question–answering systems have become increasingly popular using machine reading comprehension techniques. On the other hand, high-resource languages (e.g., English and Chinese) have witnessed tremendous growth in question-answering methodologies based on various knowledge sources. Besides, powerful BERTology-based language models only encode texts with a limited length. The longer texts contain more distractor sentences that affect the QA system performance. Vietnamese has a variety of question words in the same question type. To address these challenges, we propose ViQAS, a new question–answering system with multi-stage transfer learning using language models based on BERTology for a low-resource language such as Vietnamese. Last but not least, our QA system is integrated with Vietnamese characteristics and transformer-based evidence extraction techniques into an effective contextualized language model-based QA system. As a result, our proposed system outperforms our forty retriever-reader QA configurations and seven state-of-the-art QA systems such as DrQA, BERTserini, BERTBM25, XLMRQA, ORQA, COBERT, and NeuralQA on three Vietnamese benchmark question answering datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. A list of phrases “noun + gì/nào” is available at https://link.uit.edu.vn/Whatphrase.

  2. https://www.sbert.net/index.html.

  3. The two datasets are available freely for research purposes.

  4. https://huggingface.co/docs/transformers/index.

  5. https://colab.research.google.com/.

  6. https://pytorch.org/.

  7. To further evaluate the type of quantitative question, we classify Others in ViQuAD and ViNewsQA into two types: How many and Others.

References

  1. Alzubi JA, Jain R, Singh A, Parwekar P, Gupta M (2021) Cobert: covid-19 question answering system using bert. Arab J Sci Eng:1–11

  2. Antol S, Agrawal A, Lu J, Mitchell M, Batr D, Zitnick CL, Parikh D (2015) Vqa: visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

  3. Bach NX, Thanh PD, Oanh TT (2020) Question analysis towards a vietnamese question answering system in the education domain. Cybern Inform Technol 20(1):112–128

    Google Scholar 

  4. Bai Y, Wang DZ (2021) More than reading comprehension: A survey on datasets and metrics of textual question answering. arXiv preprint arXiv:2109.12264

  5. Bizer C, Lehmann J, Kobilarov G, Auer S, Becker C, Cyganiak R, Hellmann S (2009) Dbpedia—a crystallization point for the web of data. J Web Seman 7(3):154–165

    Article  Google Scholar 

  6. Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp 1247–1250

  7. Braslavski P (2020) Sberquad–russian reading comprehension dataset: Description and analysis. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, vol. 12260, pp 3. Springer Nature

  8. Chen D, Bolton J, Manning CD (2016) A thorough examination of the cnn/daily mail reading comprehension task. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp 2358–2367

  9. Chen D, Fisch A, Weston J, Bordes A (2017) Reading wikipedia to answer open-domain questions. Proc ACL 2017:1870–1879

    Google Scholar 

  10. Chen D, Yih W-T (2020) Open-domain question answering. Proc ACL 2020:34–37

    Google Scholar 

  11. Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán Francisco, Grave Édouard, Ott Myle, Zettlemoyer Luke, Stoyanov Veselin (2020) Unsupervised cross-lingual representation learning at scale. Proc ACL 2020:8440–8451

    Google Scholar 

  12. Cui Y, Liu T, Che W, Xiao L, Chen , Ma W, Wang S, Hu G (2019) A span-extraction dataset for Chinese machine reading comprehension. In Proceedings of EMNLP-IJCNLP 2019, pp 5883–5889, Hong Kong, Chinar. Association for Computational Linguistics

  13. Cui Y, Liu T, Che W, Xiao L, Chen Z, Ma Wentao, Wang Shijin, Guoping Hu (2019) A span-extraction dataset for chinese machine reading comprehension. Proc EMNLP-IJCNLP 2019:5883–5889

    Google Scholar 

  14. Das R, Dhuliawala S, Zaheer M, McCallum A (2018) Multi-step retriever-reader interaction for scalable open-domain question answering. In: ICLR

  15. Devlin J, Chang M-W, Lee K, Toutanova K (2019) Bert: pre-training of deep bidirectional transformers for language understanding. Proc NAACL 2019:4171–4186

    Google Scholar 

  16. d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) FQuAD: french question answering dataset. In: EMNLP 2020 (Findings), pp 1193–1208, Online. Association for Computational Linguistics

  17. Dibia V (2020) Neuralqa: a usable library for question answering (contextual query expansion+ bert) on large datasets. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp 15–22

  18. Do Phong N-T, Nguyen Nhat D, Van Huynh T, Van Nguyen K, Nguyen Anh G-T, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. In Han Qiu, Cheng Zhang, Zongming Fei, Meikang Qiu, and Sun-Yuan Kung, editors, Knowledge Science, Engineering and Management - 14th International Conference, KSEM 2021, Tokyo, Japan, August 14-16, 2021, Proceedings, Part II, volume 12816 of Lecture Notes in Computer Science, pp 511–523. Springer

  19. Do Phong N-T, Nguyen ND, Van Huynh T, Van Nguyen K, Gia-Tuan NA, Nguyen Ngan L-T (2021) Sentence extraction-based machine reading comprehension for vietnamese. Knowl Sci Eng Manag. In: 14th International Conference

  20. Do P, Phan THV (2022) Developing a bert based triple classification model using knowledge graph embedding for question answering system. Appl Intell 52(1):636–651

    Article  Google Scholar 

  21. Do P, Phan THV, Gupta BB (2021) Developing a vietnamese tourism question answering system using knowledge graph and deep learning. Transa Asian Low-Resou Lang Inform Process 20(5):1–18

    Article  Google Scholar 

  22. Doan AL, Luu ST (2022) Improving sentiment analysis by emotion lexicon approach on vietnamese texts. arXiv preprint arXiv:2210.02063

  23. Dua D, Wang Y, Dasigi P, Stanovsky G, Singh S, Gardner M (2019) Drop: a reading comprehension benchmark requiring discrete reasoning over paragraphs. In: NAACL-HLT (1)

  24. Duong Huu-Thanh, Ho Bao-Quoc (2015) A vietnamese question answering system in vietnam’s legal documents. In: IFIP International Conference on Computer Information Systems and Industrial Management, pp 186–197. Springer

  25. d’Hoffschmidt M, Belblidia W, Heinrich Q, Brendlé T, Vidal M (2020) Fquad: French question answering dataset. In: Proceedings of EMNLP 2020 (Findings), pp 1193–1208

  26. Efimov P, Chertok A, Boytsov L, Braslavski P (2020) Sberquad–russian reading comprehension dataset: description and analysis. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp 3–15. Springer

  27. Feldman Y, El-Yaniv R (2019) Multi-hop paragraph retrieval for open-domain question answering. Proc ACL 2019:2296–2309

    Google Scholar 

  28. Green Jr BF, Wolf AK, Chomsky C, Laughery K (1961) Baseball: an automatic question-answerer. In: Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference, pp 219–224

  29. Guu K, Lee K, Tung Z, Pasupat P, Chang M (2020) Retrieval augmented language model pre-training. In: International Conference on Machine Learning, pp 3929–3938. PMLR

  30. Harabagiu S, Moldovan D, Clark C, Bowden M, Williams J, Bensley J (2003) Answer mining by combining extraction techniques with abductive reasoning. Proc. TREC 2003:375–382

    Google Scholar 

  31. Harabagiu S, Pasca M, Maiorano SJ (2000) Experiments with open-domain textual question answering. In: COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics

  32. Hedderich MA, Lange L, Adel H, Strötgen J, Klakow D (2021) A survey on recent approaches for natural language processing in low-resource scenarios. In: Proceedings of NAACL 2021, pp 2545–2568

  33. Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. Adv Neural Inform Process Systems 28

  34. Howard J, Ruder S (2018) Universal language model fine-tuning for text classification. In: Proceedings of ACL 2018 (Volume 1: Long Papers), pp 328–339

  35. Huang H-Y, Zhu C, Shen Y, Weizhu C (2018) Fusing via fully-aware attention with application to machine comprehension. In: ICLR, Fusionnet

  36. Izacard G, Grave E (2021) Distilling knowledge from reader to retriever for question answering. In: ICLR 2021

  37. Izacard G, Grave É (2021) Leveraging passage retrieval with generative models for open domain question answering. Proc EACL 2021:874–880

    Google Scholar 

  38. Kafle K, Kanan C (2017) Visual question answering: datasets, algorithms, and future challenges. Comput Vis Image Understand 163:3–20

    Article  Google Scholar 

  39. Karpukhin V, Oguz B, Min S, Lewis P, Ledell W, Edunov S, Chen D, Yih W-T (2020) Dense passage retrieval for open-domain question answering. Proc EMNLP 2020:6769–6781

    Google Scholar 

  40. Kratzwald B, Eigenmann A, Feuerriegel S (2019) Rankqa: neural question answering with answer re-ranking. Proc ACL 2019:6076–6085

    Google Scholar 

  41. Lan Z, Chen M, Goodman S, Gimpel K, Sharma P, Soricut R (2019) Albert: a lite bert for self-supervised learning of language representations. In: ICLR 2019

  42. Lee K, Chang M-W, Toutanova K (2019) Latent retrieval for weakly supervised open domain question answering. Proc ACL 2019:6086–6096

    Google Scholar 

  43. Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W-t, Rocktäschel T et al (2020) Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv Neural Inform Process Syst 33:9459–9474

    Google Scholar 

  44. Lim S, Kim M, Lee J (2019) Korquad1.0: Korean qa dataset for machine reading comprehension. arXiv preprint arXiv:1909.07005

  45. Lin J, Ma X, Lin S-C, Yang J-H, Pradeep R, Nogueira R (2021) Pyserini: a python toolkit for reproducible information retrieval research with sparse and dense representations. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 2356–2362

  46. Lin Y, Ji H, Liu Z, Sun M (2018) Denoising distantly supervised open-domain question answering. Proc ACL 2018:1736–1745

    Google Scholar 

  47. Liu S, Zhang X, Zhang S, Wang H, Zhang W (2019) Neural machine reading comprehension: methods and trends. Appl Sci 9(18):3698

    Article  Google Scholar 

  48. Messaoudi A, Haddad H, Ben Haj HM (2020) icompass at semeval-2020 task 12: from a syntax-ignorant n-gram embeddings model to a deep bidirectional language model. In: Proceedings of the Fourteenth Workshop on Semantic Evaluation, pp 1978–1982

  49. Min S, Chen D, Zettlemoyer L, Hajishirzi H (2019) Knowledge guided text retrieval and reading for open domain question answering. arXiv preprint arXiv:1911.03868

  50. Nguyen DQ, Tuan NA (2020) PhoBERT: pre-trained language models for Vietnamese. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp 1037–1042, Online. Association for Computational Linguistics

  51. Nguyen Kiet, Nguyen Vu, Nguyen Anh, Nguyen Ngan (2020) A vietnamese dataset for evaluating machine reading comprehension. In: Proceedings of the 28th International Conference on Computational Linguistics, pp 2595–2605

  52. Van Nguyen K, Do Phong N-T, Nguyen ND, Van Huynh T, Nguyen AG-T, Nguyen Ngan L-T (2022) Xlmrqa: Open-domain question answering on vietnamese wikipedia-based textual knowledge source. In: the 14th Asian Conference on Intelligent Information and Database Systems (Accepted)

  53. Nogueira R, Cho K (2019) Passage re-ranking with bert. arXiv preprint arXiv:1901.04085

  54. Noraset T, Lowphansirikul L, Tuarob S (2021) Wabiqa: a wikipedia-based thai question-answering system. Inform Process Manag 58(1):102431

    Article  Google Scholar 

  55. Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL 2018, Volume 1 (Long Papers), pp 2227–2237

  56. Phan T, Do P (2021) Building a vietnamese question answering system based on knowledge graph and distributed cnn. Neural Comput Appl: 1–21

  57. Pyysalo S, Kanerva J, Virtanen A, Ginter F (2021) Wikibert models: deep transfer learning for many languages. NoDaLiDa 2021, pp 1

  58. Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) Squad: 100,000+ questions for machine comprehension of text. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp 2383–2392

  59. Reimers N, Gurevych I (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Assoc Comput Linguist 11

  60. Reimers N, Gurevych I (2020) Making monolingual sentence embeddings multilingual using knowledge distillation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11

  61. Richardson M, Burges Christopher JC, Renshaw E (2013) Mctest: a challenge dataset for the open-domain machine comprehension of text. In: Proceedings of the 2013 conference on empirical methods in natural language processing, pp 193–203

  62. Rogers A, Kovaleva O, Rumshisky A (2020) A primer in bertology: what we know about how bert works. TACL 8:842–866

    Article  Google Scholar 

  63. Seo M, Kembhavi A, Farhadi A, Hajishirzi H (2016) Bidirectional attention flow for machine comprehension. arXiv preprint. arXiv:1611.01603

  64. So BH, Byun K, Kang K, Cho S (2022) Jaquad: Japanese question answering dataset for machine reading comprehension. arXiv preprint. arXiv:2202.01764

  65. Tapeh AG, Rahgozar M (2008) A knowledge-based question answering system for b2c ecommerce. Knowl Based Syst 21(8):946–950

    Article  Google Scholar 

  66. Tran M-V, Le D-T, Tran XT, Nguyen T-T (2012) A model of vietnamese person named entity question answering system. In: Proceedings of PACLIC 2012, pp 325–332

  67. Tran TK (2015) Sentivoice-a system for querying hotel service reviews via phone. In: RIVF 2015, pp 65–70. IEEE

  68. Trotman A, Puurula A, Burgess B (2014) Improvements to bm25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium, pp 58–65

  69. Van HT, Van Nguyen K, Nguyen NL-T (2022) Vinli: a vietnamese corpus for studies on open-domain natural language inference. In: Proceedings of the 29th International Conference on Computational Linguistics, pp 3858–3872

  70. Van Nguyen K, Nguyen ND, Do PN-T, Nguyen AG-T, Nguyen NL-T (2021) Vireader: a wikipedia-based vietnamese reading comprehension system using transfer learning. J Intell Fuzzy Syst 41:1–19

    Google Scholar 

  71. Van Nguyen K, Tran KV, Luu ST, Nguyen AG-T, Nguyen NL-T (2020) Enhancing lexical-based approach with external knowledge for vietnamese multiple-choice machine reading comprehension. IEEE Access 8:201404–201417

    Article  Google Scholar 

  72. Van Nguyen K, Van Huynh T, Nguyen D-V, Nguyen AG-T, Nguyen NL-T (2022) New vietnamese corpus for machine reading comprehension of health news articles. Trans Asian Low-Resour Lang Inform Process 21(5):1–28

    Article  Google Scholar 

  73. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, pp 5998–6008

  74. Voorhees Ellen M et al. (1999) The trec-8 question answering track report. In: Trec, vol. 99, pp 77–82

  75. Wang H, Dian Y, Sun K, Chen J, Dong Y, McAllester D, Roth D (2019) Evidence sentence extraction for machine reading comprehension. Proc CoNLL 2019:696–707

    Google Scholar 

  76. Wang S, Yu M, Guo X, Wang Z, Klinger T, Zhang W, Chang S, Tesauro G, Zhou B, Jiang J (2018) R3: reinforced ranker-reader for open-domain question answering. In: AAAI 2018

  77. Wang Z, Ng P, Ma X, Nallapati R, Xiang B (2019) Multi-passage bert: a globally normalized bert model for open-domain question answering. Proc EMNLP-IJCNLP 2019:5878–5882

    Google Scholar 

  78. Woods WA (1973) Progress in natural language understanding: an application to lunar geology. In: Proceedings of the June 4-8, 1973, national computer conference and exposition, pp 441–450

  79. Wu B, Zhang H, Li MY, Wang Z, Feng Q, Huang J, Wang B (2020) Towards non-task-specific distillation of bert via sentence representation approximation. In: Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp 70–79

  80. Xiong W, Li X, Iyer S, Du J, Lewis P, Wang WY, Mehdad Y, Yih S, Riedel S, Kiela D, et al. (2020) Answering complex open-domain questions with multi-hop dense retrieval. In: ICML 2020

  81. Yang W, Xie Y, Lin A, Li X, Tan L, Xiong K, Li M, Lin J (2019) End-to-end open-domain question answering with bertserini. Proc NAACL 2019:72–77

    Google Scholar 

  82. Yang Z, Qi P, Zhang S, Bengio Y, Cohen W, Salakhutdinov R, Manning CD (2018) Hotpotqa: a dataset for diverse, explainable multi-hop question answering. In: Proceedings of EMNLP 2018, pp 2369–2380

  83. Zhang Z, Zhao H, Wang R (2020) Machine reading comprehension: the role of contextualized language models and beyond. Computat Ling 1(1)

  84. Zhao T, Xiaopeng L, Lee K (2021) Sparta: efficient open-domain question answering via sparse transformer matching retrieval. Proceedings of NAACL 2021:565–575

    Google Scholar 

  85. Zhu F, Lei W, Wang C, Zheng J, Poria S, Chua T-S (2021) Retrieving and reading: a comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774

Download references

Acknowledgements

This research was supported by The VNUHCM-University of Information Technology's Scientific Research Support Fund.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ngan Luu-Thuy Nguyen.

Ethics declarations

Conflict of interest

The authors affirm that they do not have any conflicting interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Van Nguyen, K., Do, P.NT., Nguyen, N.D. et al. Multi-stage transfer learning with BERTology-based language models for question answering system in vietnamese. Int. J. Mach. Learn. & Cyber. 14, 1877–1902 (2023). https://doi.org/10.1007/s13042-022-01735-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-022-01735-z

Keywords

Navigation