Impact of Context on Keyword Identification and Use in Biomedical Literature Mining

  • Venu G. DasigiEmail author
  • Orlando Karam
  • Sailaja Pydimarri
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 880)


The use of two statistical metrics in automatically identifying important keywords associated with a concept such as a gene by mining scientific literature is reviewed. Starting with a subset of MEDLINE® abstracts that contain the name or synonyms of a gene in their titles, the aforementioned metrics contrast the prevalence of specific words in these documents against a broader “background set” of abstracts. If a word occurs substantially more often in the document subset associated with a gene than in the background set that acts as a reference, then the word is viewed as capturing some specific attribute of the gene.

The keywords thus automatically identified may be used as gene features in clustering algorithms. Since the background set is the reference against which keyword prevalence is contrasted, the authors hypothesize that different background document sets can lead to somewhat different sets of keywords to be identified as specific to a gene. Two different background sets are discussed that are useful for two somewhat different purposes, namely, characterizing the function of a gene, and clustering a set of genes based on their shared functional similarities. Experimental results that reveal the significance of the choice of background set are presented.


Literature mining Automatic keyword identification TF-IDF Z-score Background set Features Clustering 



The authors acknowledge that the MEDLINE® data used in this research are covered by a license agreement supported by the U.S. National Library of Medicine. Thanks are also due to Professor Rajnish Singh (Kennesaw State University) for her assistance in relation to evaluating the keywords for the various genes, and for her help in other ways related to this work.


  1. 1.
    Andrade, M., Valencia, A.: Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics 14(7), 600–607 (1998). Scholar
  2. 2.
    Cherepinsky, V., Feng, J., Rejali, M., Mishra, B.: Shrinkage based similarity metric for cluster analysis of microarray data. Proc. Natl. Acad. Sci. USA 100(17), 418–427 (2003). Scholar
  3. 3.
    Dasigi, V., Karam, O., Pydimarri, S.: An evaluation of keyword selection on gene clustering in biomedical literature mining. In: Proceedings of Fifth IASTED International Conference on Computational Intelligence, pp. 119–124 (2010). URL:
  4. 4.
    Hamdan, H., Bellot, P., Béchet, F.: The impact of Z-score on Twitter sentiment analysis. In: Proceedings of 8th International Workshop on Semantic Evaluation, pp. 596–600 (2014).
  5. 5.
    Hartigan, J.A., Wong, M.A.: Algorithm AS 136: a K-means clustering algorithm. J. R. Stat. Soc. Ser. C (Appl. Stat.) 28(1), 100–108 (1979). Scholar
  6. 6.
    Ikeda, D., Suzuki, E.: Mining peculiar compositions of frequent substrings from sparse text data using background texts. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Lecture Notes in Artificial Intelligence, vol. 5781, pp. 596–611 (2009). Scholar
  7. 7.
    Liu, Y., Navathe, S., Pivoshenko, A., Dasigi, V., Dingledine, R., Ciliax, B.: Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes. Int. J. Data Min. Bioinform. 1(1), 88–110 (2006). Scholar
  8. 8.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Venu G. Dasigi
    • 1
    Email author
  • Orlando Karam
    • 2
  • Sailaja Pydimarri
    • 3
  1. 1.Bowling Green State UniversityBowling GreenUSA
  2. 2.Kennesaw State UniversityMariettaUSA
  3. 3.Life UniversityMariettaUSA

Personalised recommendations