Abstract
The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, ‘Evolving role of diabetes educators’, ‘Cancer risk assessment’ and ‘Dynamic concepts on disease and comorbidity’ to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.r-project.org/web/packages/pubmed.mineR .
Similar content being viewed by others
References
Bodenhofer U, Kothmeier A and Hochreiter S 2011 APCluster: an R package for affinity propagation clustering. Bioinformatics 27 2463–2464
Canese K and Weis S 2013 updated PubMed: The Bibliographic Database; in The NCBI Handbook [Internet] 2nd edition
Cheng D, Knox C, Young N, Stothard P, Damaraju S and Wishart DS 2008 PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36 399–405
Cohen KB and Hunter LE 2013 Chapter 16: Text mining for translational bioinformatics. PLoS Comput. Biol. 9 e1003044
Davi A, Haughton D, Nasr N, Shah G, Skaletsky M and Spack R 2005 A Review of Two Text-Mining Packages: SAS TextMining and WordStat. Am. Stat. 59 89–103
Delfs R, Doms A, Kozlenkov A and Schroeder M 2004 GoPubMed: ontology-based literature search applied to GeneOntology and PubMed; in Proceedings of German Bioinformatics Conference pp 169–178
Drab S 2013 The Evolving Role of Diabetes Educators. Am. J. Med. Sci. 345 307–313
Feinerer I, Hornik K and Meyer D 2008 Text mining infrastructure in R. J. Stat. Softw. 25 1–54
Frey BJ and Dueck D 2007 Clustering by passing messages between data points. Science 31 5972–5976
Frisch M, Klocke B, Haltmeier M and Frech K 2009 LitInspector: literature and signal transduction pathway mining in PubMed abstracts. Nucleic Acids Res. 37 135–140
Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, et al. 2004 Bioconductor: open software development for computationalbiology and bioinformatics. Genome Biol. 5 R80
Giron J, Ginebra J and Riba A 2005 Bayesian analysis of a multinomial sequence and homogeneity of literary style. Am. Stat. 59 19–30
Gray KA, Yates B, Seal RL, Wright MW and Bruford EA 2015 Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. doi:10.1093/nar/gku1071
Korhonen A, Silins I, Sun L and Stenius U 2009 The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature. BMC Bioinf. 10 303
Maglott D, Ostell J, Pruitt KD and Tatusova T 2011 Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39 D52–D57
Radlinski F and Joachims T 2007 Active exploration for learning rankings from click-through data; in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp 570–579
Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia S, Pico AR, Bader GD, et al. 2012 A travel guide to Cytoscape plugins. Nat. Methods 9 1069–1076
The UniProt Consortium 2014 Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 42 D191–D198
Wild F 2007 lsa: Latent Semantic Analysis; R package version 0.63-3, http://CRAN.R-project.org/package=lsa
Acknowledgements
The authors thank Smriti Sharma and Inna Mittal for using the package and providing valuable feedback on text-mining and for suggesting improvements to the algorithms.
This work was supported by Council of Scientific and Industrial Research grant BSC0122.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/oct2015/supp/Rani.pdf
[Rani J, Shah AR and Ramachandran S 2015 pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J. Biosci.] DOI 10.1007/s12038-015-9552-2
Electronic supplementary material
Below is the link to the electronic supplementary material.
ESM 1
(PDF 3.94 MB)
Glossary
- Association
-
A term used to denote ‘closeness’ in relationship between a pair of terms.
- Concept
-
A word referring to how it works. Examples – diabetes education, self-management, depigmentation, autoimmune.
- Corpus
-
A collection of documents. plural-corpora
- Document summarization
-
A short summary of the document including the most important parts such as brief introduction and conclusion.
- Pre-processing
-
The process of preparing for analysis using mathematical approaches or other search and display utilities. Examples - word tokenization, sentence tokenization.
- Term
-
A word with exact meaning. Examples -patient, vitiligo, diabetes educator.
- Term-document matrix
-
A numerical matrix where terms are in rows and documents are in columns and the cells contain frequencies of occurrence of terms in the documents.
- Text classification
-
Classifying the documents under defined terms.
- Themes
-
Subjects usually defined by terms and preferably non-overlapping.
Rights and permissions
About this article
Cite this article
Rani, J., Shah, A.R. & Ramachandran, S. pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J Biosci 40, 671–682 (2015). https://doi.org/10.1007/s12038-015-9552-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12038-015-9552-2