Skip to main content

Advertisement

Log in

pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts

  • Published:
Journal of Biosciences Aims and scope Submit manuscript

Abstract

The PubMed literature database is a valuable source of information for scientific research. It is rich in biomedical literature with more than 24 million citations. Data-mining of voluminous literature is a challenging task. Although several text-mining algorithms have been developed in recent years with focus on data visualization, they have limitations such as speed, are rigid and are not available in the open source. We have developed an R package, pubmed.mineR, wherein we have combined the advantages of existing algorithms, overcome their limitations, and offer user flexibility and link with other packages in Bioconductor and the Comprehensive R Network (CRAN) in order to expand the user capabilities for executing multifaceted approaches. Three case studies are presented, namely, ‘Evolving role of diabetes educators’, ‘Cancer risk assessment’ and ‘Dynamic concepts on disease and comorbidity’ to illustrate the use of pubmed.mineR. The package generally runs fast with small elapsed times in regular workstations even on large corpus sizes and with compute intensive functions. The pubmed.mineR is available at http://cran.r-project.org/web/packages/pubmed.mineR .

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4

Similar content being viewed by others

References

  • Bodenhofer U, Kothmeier A and Hochreiter S 2011 APCluster: an R package for affinity propagation clustering. Bioinformatics 27 2463–2464

    Article  CAS  PubMed  Google Scholar 

  • Canese K and Weis S 2013 updated PubMed: The Bibliographic Database; in The NCBI Handbook [Internet] 2nd edition

  • Cheng D, Knox C, Young N, Stothard P, Damaraju S and Wishart DS 2008 PolySearch: a web-based text mining system for extracting relationships between human diseases, genes, mutations, drugs and metabolites. Nucleic Acids Res. 36 399–405

    Article  Google Scholar 

  • Cohen KB and Hunter LE 2013 Chapter 16: Text mining for translational bioinformatics. PLoS Comput. Biol. 9 e1003044

    Article  PubMed Central  PubMed  Google Scholar 

  • Davi A, Haughton D, Nasr N, Shah G, Skaletsky M and Spack R 2005 A Review of Two Text-Mining Packages: SAS TextMining and WordStat. Am. Stat. 59 89–103

    Article  Google Scholar 

  • Delfs R, Doms A, Kozlenkov A and Schroeder M 2004 GoPubMed: ontology-based literature search applied to GeneOntology and PubMed; in Proceedings of German Bioinformatics Conference pp 169–178

  • Drab S 2013 The Evolving Role of Diabetes Educators. Am. J. Med. Sci. 345 307–313

    Article  PubMed  Google Scholar 

  • Feinerer I, Hornik K and Meyer D 2008 Text mining infrastructure in R. J. Stat. Softw. 25 1–54

    Article  Google Scholar 

  • Frey BJ and Dueck D 2007 Clustering by passing messages between data points. Science 31 5972–5976

    Google Scholar 

  • Frisch M, Klocke B, Haltmeier M and Frech K 2009 LitInspector: literature and signal transduction pathway mining in PubMed abstracts. Nucleic Acids Res. 37 135–140

    Article  Google Scholar 

  • Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, et al. 2004 Bioconductor: open software development for computationalbiology and bioinformatics. Genome Biol. 5 R80

    Article  PubMed Central  PubMed  Google Scholar 

  • Giron J, Ginebra J and Riba A 2005 Bayesian analysis of a multinomial sequence and homogeneity of literary style. Am. Stat. 59 19–30

  • Gray KA, Yates B, Seal RL, Wright MW and Bruford EA 2015 Genenames.org: the HGNC resources in 2015. Nucleic Acids Res. doi:10.1093/nar/gku1071

  • Korhonen A, Silins I, Sun L and Stenius U 2009 The first step in the development of text mining technology for cancer risk assessment: identifying and organizing scientific evidence in risk assessment literature. BMC Bioinf. 10 303

    Article  Google Scholar 

  • Maglott D, Ostell J, Pruitt KD and Tatusova T 2011 Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 39 D52–D57

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • Radlinski F and Joachims T 2007 Active exploration for learning rankings from click-through data; in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining pp 570–579

  • Saito R, Smoot ME, Ono K, Ruscheinski J, Wang PL, Lotia S, Pico AR, Bader GD, et al. 2012 A travel guide to Cytoscape plugins. Nat. Methods 9 1069–1076

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  • The UniProt Consortium 2014 Activities at the Universal Protein Resource (UniProt). Nucleic Acids Res. 42 D191–D198

    Article  PubMed Central  Google Scholar 

  • Wild F 2007 lsa: Latent Semantic Analysis; R package version 0.63-3, http://CRAN.R-project.org/package=lsa

Download references

Acknowledgements

The authors thank Smriti Sharma and Inna Mittal for using the package and providing valuable feedback on text-mining and for suggesting improvements to the algorithms.

This work was supported by Council of Scientific and Industrial Research grant BSC0122.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Srinivasan Ramachandran.

Additional information

Supplementary materials pertaining to this article are available on the Journal of Biosciences Website at http://www.ias.ac.in/jbiosci/oct2015/supp/Rani.pdf

[Rani J, Shah AR and Ramachandran S 2015 pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J. Biosci.] DOI 10.1007/s12038-015-9552-2

Electronic supplementary material

Below is the link to the electronic supplementary material.

ESM 1

(PDF 3.94 MB)

Glossary

Association

A term used to denote ‘closeness’ in relationship between a pair of terms.

Concept

A word referring to how it works. Examples – diabetes education, self-management, depigmentation, autoimmune.

Corpus

A collection of documents. plural-corpora

Document summarization

A short summary of the document including the most important parts such as brief introduction and conclusion.

Pre-processing

The process of preparing for analysis using mathematical approaches or other search and display utilities. Examples - word tokenization, sentence tokenization.

Term

A word with exact meaning. Examples -patient, vitiligo, diabetes educator.

Term-document matrix

A numerical matrix where terms are in rows and documents are in columns and the cells contain frequencies of occurrence of terms in the documents.

Text classification

Classifying the documents under defined terms.

Themes

Subjects usually defined by terms and preferably non-overlapping.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rani, J., Shah, A.R. & Ramachandran, S. pubmed.mineR: An R package with text-mining algorithms to analyse PubMed abstracts. J Biosci 40, 671–682 (2015). https://doi.org/10.1007/s12038-015-9552-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12038-015-9552-2

Keywords

Navigation