COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Sanku, Satya Uday; Pavani, Satti Thanuja; Lakshmi, T. Jaya; Chivukula, Rohit

doi:10.1007/s42979-023-02550-1

COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Original Research
Published: 17 January 2024

Volume 5, article number 211, (2024)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Satya Uday Sanku^1,3,
Satti Thanuja Pavani^1,4,
T. Jaya Lakshmi ORCID: orcid.org/0000-0003-0183-4093¹ &
…
Rohit Chivukula²

129 Accesses
Explore all metrics

Abstract

In light of the recent COVID-19 epidemic, users are facing growing difficulties in navigating the vast expanse of Internet content to locate relevant information. In this study, we have developed an information extraction mechanism to address users’ inquiries pertaining to COVID-19, catering to a range of depths in response. To accomplish this objective, the CORD-19 dataset, which has been made available by the Allen Institute for AI, is utilized. This dataset comprises 200,000 scholarly articles that pertain to research papers on the topic of coronavirus. These articles have been sourced from many reputable platforms, such as PubMed’s PMC, WHO, bioRxiv, and medRxiv pre-prints. In addition to the aforementioned document corpus, a supplementary list of topics has been furnished, encompassing inquiries pertaining to the infection. Each topic consists of three levels of representations, namely query, question, and story. Inquiry can take on different forms, with query representing a fundamental form, question serving as a more intermediate form, and narrative embodying a more detailed and elaborate type of inquiry. The proposed model uses various word embedding techniques, such as frequency based (Bag-of-words), semantic based (Word2Vec), a hybrid method which combine frequency with semantic (TF–IDF weighted Word2Vec), as well as sequence cum semantic based (BERT) to fabricate vectors for the documents in the corpus, query, question, narrative, and combinations of them. Once vectors have been created, cosine similarity is employed to identify similarities between document vectors and topic vectors. As compared to frequency and semantic models, BERT demonstrates a higher degree of relevance in retrieving documents. with 90% accuracy. The proposed hybrid model, which is the TF–IDF weighted Word2Vec, achieves an accuracy rate of 87%. This is comparable to the average performance of the BERT-Base model demonstrating computational efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Prioritization of COVID-19-Related Literature via Unsupervised Keyphrase Extraction and Document Representation Learning

AWS CORD-19 Search: A Neural Search Engine for COVID-19 Literature

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

Article Open access 12 April 2021

Data Availability Statement

Data used in this work are publicly available at kaggle website: https://www.kaggle.com/competitions/trec-covid-information-retrieval.

References

Heaton CT, Mitra P. Repurposing trec-covid annotations to answer the key questions of cord-19. arXiv preprint. 2020 arXiv:2008.12353.
Voorhees E, Soboroff I, Reade W, Elliott J. TREC-COVID Information Retrieval. Kaggle. 2020. https://kaggle.com/competitions/trec-covid-information-retrieval. Accessed 2 Apr 2022.
Nadkarni PM, Ohno-Machado L, Chapman WW. Natural language processing: an introduction. J Am Med Inf Assoc. 2011;18:544–51.
Article Google Scholar
Domingos P. A few useful things to know about machine learning. Commun ACM. 2012;55:78–87.
Article Google Scholar
Collobert R, et al. Natural language processing (almost) from scratch. J Mach Learn Res. 2011;12:2493–537.
Google Scholar
Ma L, Zhang Y. Using word2vec to process big text data. In: IEEEE, 2015; p. 2895–2897.
Deepu S, Pethuru R, Rajaraajeswari S. A framework for text analytics using the Bag of Words (BoW) model for prediction. Int J Adv Netw Appl (IJANA). 2016;2(1):320–3.
Google Scholar
Zhang Y, Jin R, Zhou ZH. Understanding bag-of-words model: a statistical framework. Int J Mach Learn Cybern 2010;1:43–52.
Article Google Scholar
Erk K, Padó S. A structured vector space model for word meaning in context. In: Proceedings of the 2008 conference on empirical methods in natural language processing. 2008; p. 897–906.
Wang P, et al. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing. 2016;174:806–14.
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. 2013. arXiv preprint arXiv:1301.3781.
Jang B, Kim I, Kim JW. Word2vec convolutional neural networks for classification of news articles and tweets. PLoS One. 2019;14: e0220976.
Article Google Scholar
Yao S, et al. A survey of transfer learning for machinery diagnostics and prognostics. Artif Intell Rev. 2023;56(4):2871–922.
Article MathSciNet Google Scholar
Maher K, Joshi MS. Effectiveness of different similarity measures for text classification and clustering. Int J Comput Sci Inf Technol. 2016;7(4):1715–20.
Google Scholar
Larsen B, Aone C. Fast and effective text mining using linear-time document clustering. In: Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. 1999; p. 16–22
Modern Baeza-Yates R. Retrieval Information. Addison Wesley google scholar. 1999;2:127–36.
Google Scholar
Faisal R, Kitasuka T, Aritsugi M. Semantic cosine similarity. The 7th international student conference on advanced science and technology ICAST. 2012; Vol. 4, no. 1, p. 1
Movassagh AA, et al. Artificial neural networks training algorithm integrating invasive weed optimization with differential evolutionary model. J Ambient Intell Humanized Comput. 2021;1–9.
Alzubi Omar A, et al. An efficient malware detection approach with feature weighting based on Harris Hawks optimization. Cluster Comput. 2022;1–19.
Alzubi JA, et al. COBERT: COVID-19 question answering system using BERT. Arab J Sci Eng 2023;48(8):11003–11013.
Article Google Scholar
Alzubi JA, et al. Paraphrase identification using collaborative adversarial networks. J Intell Fuzzy Syst. 2020;39:1021–32.
Article Google Scholar
Alzubi JA, et al. Deep image captioning using an ensemble of cnn and lstm based deep neural networks. J Intell Fuzzy Syst. 2021;40:5761–9.
Article Google Scholar
Abdelrazek A, Eid Y, Gawish E, Medhat W, Hassan A. Topic modeling algorithms and applications: a survey. Inform Syst. 2022;p. 102131.
Khadhraoui M, Bellaaj H, Ammar MB, Hamam H, Jmaiel M. Survey of bert-base models for scientific text classification: Covid-19 case study. Appl Sci. 2022;12:2891.
Article Google Scholar
Incitti F, Urli F, Snidaro L. Beyond word embeddings: a survey. Inform Fusion. 2023;89:418–36.
Article Google Scholar
Jivani AG, et al. A comparative study of stemming algorithms. Int J Comp Tech Appl. 2011;2:1930–8.
Google Scholar
Alammar J. The illustrated word2vec. Visualizing Machine Learning One Concept at a Time Blog (2019).
Mikolov T, et al. Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 2013;26.
Guthrie D, et al. A closer look at skip-gram modelling. LREC. Vol. 6. 2006. p. 1222–1225.
Mohammed M, Omar N. Question classification based on bloom’s taxonomy cognitive domain using modified tf-idf and word2vec. PLoS One. 2020;15: e0230442.
Article Google Scholar

Download references

Funding

No external funding is received for this work.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, SRM University, Guntur, Andhra Pradesh, 522502, India
Satya Uday Sanku, Satti Thanuja Pavani & T. Jaya Lakshmi
School of Computing, University of Huddersfiled, Huddersfield, HD1 3DH, UK
Rohit Chivukula
College of Engineering, University of South Florida, Tampa, Florida, USA
Satya Uday Sanku
Department of Computer Science, San Diego State University, San Diego, California, USA
Satti Thanuja Pavani

Authors

Satya Uday Sanku
View author publications
You can also search for this author in PubMed Google Scholar
Satti Thanuja Pavani
View author publications
You can also search for this author in PubMed Google Scholar
T. Jaya Lakshmi
View author publications
You can also search for this author in PubMed Google Scholar
Rohit Chivukula
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The main research idea and implementation is of Satya Uday Sanku and Satti Thanuja Pavani equally; documentation is done by Rohit Chivukula; results are interpretted and analysed by T. Jaya Lakshmi.

Corresponding author

Correspondence to T. Jaya Lakshmi.

Ethics declarations

Conflict of Interest

The authors do not have any competing interests.

Research Involving Human and/or Animals

Not applicable.

Informed Consent

All authors provided consent for this publication.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Sanku, S.U., Pavani, S.T., Lakshmi, T.J. et al. COVID-19 Literature Mining and Retrieval Using Text Mining Approaches. SN COMPUT. SCI. 5, 211 (2024). https://doi.org/10.1007/s42979-023-02550-1

Download citation

Received: 09 April 2023
Accepted: 10 December 2023
Published: 17 January 2024
DOI: https://doi.org/10.1007/s42979-023-02550-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Abstract

Access this article

Similar content being viewed by others

Prioritization of COVID-19-Related Literature via Unsupervised Keyphrase Extraction and Document Representation Learning

AWS CORD-19 Search: A Neural Search Engine for COVID-19 Literature

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

Data Availability Statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Research Involving Human and/or Animals

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

COVID-19 Literature Mining and Retrieval Using Text Mining Approaches

Abstract

Access this article

Similar content being viewed by others

Prioritization of COVID-19-Related Literature via Unsupervised Keyphrase Extraction and Document Representation Learning

AWS CORD-19 Search: A Neural Search Engine for COVID-19 Literature

COVID-19 information retrieval with deep-learning based semantic search, question answering, and abstractive summarization

Data Availability Statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of Interest

Research Involving Human and/or Animals

Informed Consent

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation