Skip to main content
Log in

COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

In light of the recent COVID-19 epidemic, users are facing growing difficulties in navigating the vast expanse of Internet content to locate relevant information. In this study, we have developed an information extraction mechanism to address users’ inquiries pertaining to COVID-19, catering to a range of depths in response. To accomplish this objective, the CORD-19 dataset, which has been made available by the Allen Institute for AI, is utilized. This dataset comprises 200,000 scholarly articles that pertain to research papers on the topic of coronavirus. These articles have been sourced from many reputable platforms, such as PubMed’s PMC, WHO, bioRxiv, and medRxiv pre-prints. In addition to the aforementioned document corpus, a supplementary list of topics has been furnished, encompassing inquiries pertaining to the infection. Each topic consists of three levels of representations, namely query, question, and story. Inquiry can take on different forms, with query representing a fundamental form, question serving as a more intermediate form, and narrative embodying a more detailed and elaborate type of inquiry. The proposed model uses various word embedding techniques, such as frequency based (Bag-of-words), semantic based (Word2Vec), a hybrid method which combine frequency with semantic (TF–IDF weighted Word2Vec), as well as sequence cum semantic based (BERT) to fabricate vectors for the documents in the corpus, query, question, narrative, and combinations of them. Once vectors have been created, cosine similarity is employed to identify similarities between document vectors and topic vectors. As compared to frequency and semantic models, BERT demonstrates a higher degree of relevance in retrieving documents. with 90% accuracy. The proposed hybrid model, which is the TF–IDF weighted Word2Vec, achieves an accuracy rate of 87%. This is comparable to the average performance of the BERT-Base model demonstrating computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data Availability Statement

Data used in this work are publicly available at kaggle website: https://www.kaggle.com/competitions/trec-covid-information-retrieval.

References

  1. Heaton CT, Mitra P. Repurposing trec-covid annotations to answer the key questions of cord-19. arXiv preprint. 2020 arXiv:2008.12353.

  2. Voorhees E, Soboroff I, Reade W, Elliott J. TREC-COVID Information Retrieval. Kaggle. 2020. https://kaggle.com/competitions/trec-covid-information-retrieval. Accessed 2 Apr 2022.

  3. Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inf Assoc. 2011;18:544–51.

    Article  Google Scholar 

  4. Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55:78–87.

    Article  Google Scholar 

  5. Collobert R, et al. Natural language processing (almost) from scratch. J Mach Learn Res. 2011;12:2493–537.

    Google Scholar 

  6. Ma L, Zhang Y. Using word2vec to process big text data. In: IEEEE, 2015; p. 2895–2897.

  7. Deepu S, Pethuru R, Rajaraajeswari S. A framework for text analytics using the Bag of Words (BoW) model for prediction. Int J Adv Netw Appl (IJANA). 2016;2(1):320–3.

    Google Scholar 

  8. Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 2010;1:43–52.

    Article  Google Scholar 

  9. Erk K, Padó S. A structured vector space model for word meaning in context. In: Proceedings of the 2008 conference on empirical methods in natural language processing. 2008; p. 897–906.

  10. Wang P, et al. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing. 2016;174:806–14.

    Article  Google Scholar 

  11. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. arXiv preprint arXiv:1301.3781.

  12. Jang B, Kim I, Kim JW. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS One. 2019;14: e0220976.

    Article  Google Scholar 

  13. Yao S, et al. A survey of transfer learning for machinery diagnostics and prognostics. Artif Intell Rev. 2023;56(4):2871–922.

    Article  MathSciNet  Google Scholar 

  14. Maher K, Joshi MS. Effectiveness of different similarity measures for text classification and clustering. Int J Comput Sci Inf Technol. 2016;7(4):1715–20.

    Google Scholar 

  15. Larsen B, Aone C. Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 1999; p. 16–22

  16. Modern Baeza-Yates R. Retrieval Information. Addison Wesley google scholar. 1999;2:127–36.

    Google Scholar 

  17. Faisal R, Kitasuka T, Aritsugi M. Semantic cosine similarity. The 7th international student conference on advanced science and technology ICAST. 2012; Vol. 4, no. 1, p. 1

  18. Movassagh AA, et al. Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. J Ambient Intell Humanized Comput. 2021;1–9.

  19. Alzubi Omar A, et al. An efficient malware detection approach with feature weighting based on Harris Hawks optimization. Cluster Comput. 2022;1–19.

  20. Alzubi JA, et al. COBERT: COVID-19 question answering system using BERT. Arab J Sci Eng 2023;48(8):11003–11013.

    Article  Google Scholar 

  21. Alzubi JA, et al. Paraphrase identification using collaborative adversarial networks. J Intell Fuzzy Syst. 2020;39:1021–32.

    Article  Google Scholar 

  22. Alzubi JA, et al. Deep image captioning using an ensemble of cnn and lstm based deep neural networks. J Intell Fuzzy Syst. 2021;40:5761–9.

    Article  Google Scholar 

  23. Abdelrazek A, Eid Y, Gawish E, Medhat W, Hassan A. Topic modeling algorithms and applications: a survey. Inform Syst. 2022;p. 102131.

  24. Khadhraoui M, Bellaaj H, Ammar MB, Hamam H, Jmaiel M. Survey of bert-base models for scientific text classification: Covid-19 case study. Appl Sci. 2022;12:2891.

    Article  Google Scholar 

  25. Incitti F, Urli F, Snidaro L. Beyond word embeddings: a survey. Inform Fusion. 2023;89:418–36.

    Article  Google Scholar 

  26. Jivani AG, et al. A comparative study of stemming algorithms. Int J Comp Tech Appl. 2011;2:1930–8.

    Google Scholar 

  27. Alammar J. The illustrated word2vec. Visualizing Machine Learning One Concept at a Time Blog (2019).

  28. Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 2013;26.

  29. Guthrie D, et al. A closer look at skip-gram modelling. LREC. Vol. 6. 2006. p. 1222–1225.

  30. Mohammed M, Omar N. Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS One. 2020;15: e0230442.

    Article  Google Scholar 

Download references

Funding

No external funding is received for this work.

Author information

Authors and Affiliations

Authors

Contributions

The main research idea and implementation is of Satya Uday Sanku and Satti Thanuja Pavani equally; documentation is done by Rohit Chivukula; results are interpretted and analysed by T. Jaya Lakshmi.

Corresponding author

Correspondence to T. Jaya Lakshmi.

Ethics declarations

Conflict of Interest

The authors do not have any competing interests.

Research Involving Human and/or Animals

Not applicable.

Informed Consent

All authors provided consent for this publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sanku, S.U., Pavani, S.T., Lakshmi, T.J. et al. COVID-19 Literature Mining and Retrieval Using Text Mining Approaches. SN COMPUT. SCI. 5, 211 (2024). https://doi.org/10.1007/s42979-023-02550-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-023-02550-1

Keywords

Navigation