Impact of Context on Keyword Identification and Use in Biomedical Literature Mining
The use of two statistical metrics in automatically identifying important keywords associated with a concept such as a gene by mining scientific literature is reviewed. Starting with a subset of MEDLINE® abstracts that contain the name or synonyms of a gene in their titles, the aforementioned metrics contrast the prevalence of specific words in these documents against a broader “background set” of abstracts. If a word occurs substantially more often in the document subset associated with a gene than in the background set that acts as a reference, then the word is viewed as capturing some specific attribute of the gene.
The keywords thus automatically identified may be used as gene features in clustering algorithms. Since the background set is the reference against which keyword prevalence is contrasted, the authors hypothesize that different background document sets can lead to somewhat different sets of keywords to be identified as specific to a gene. Two different background sets are discussed that are useful for two somewhat different purposes, namely, characterizing the function of a gene, and clustering a set of genes based on their shared functional similarities. Experimental results that reveal the significance of the choice of background set are presented.
KeywordsLiterature mining Automatic keyword identification TF-IDF Z-score Background set Features Clustering
The authors acknowledge that the MEDLINE® data used in this research are covered by a license agreement supported by the U.S. National Library of Medicine. Thanks are also due to Professor Rajnish Singh (Kennesaw State University) for her assistance in relation to evaluating the keywords for the various genes, and for her help in other ways related to this work.
- 3.Dasigi, V., Karam, O., Pydimarri, S.: An evaluation of keyword selection on gene clustering in biomedical literature mining. In: Proceedings of Fifth IASTED International Conference on Computational Intelligence, pp. 119–124 (2010). URL: http://www.actapress.com/Abstract.aspx?paperId=43008
- 4.Hamdan, H., Bellot, P., Béchet, F.: The impact of Z-score on Twitter sentiment analysis. In: Proceedings of 8th International Workshop on Semantic Evaluation, pp. 596–600 (2014). https://doi.org/10.3115/v1/s14-2113
- 6.Ikeda, D., Suzuki, E.: Mining peculiar compositions of frequent substrings from sparse text data using background texts. In: Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, Springer Lecture Notes in Artificial Intelligence, vol. 5781, pp. 596–611 (2009). https://doi.org/10.1007/978-3-642-04180-8_56CrossRefGoogle Scholar
- 7.Liu, Y., Navathe, S., Pivoshenko, A., Dasigi, V., Dingledine, R., Ciliax, B.: Text analysis of MEDLINE for discovering functional relationships among genes: evaluation of keyword extraction weighting schemes. Int. J. Data Min. Bioinform. 1(1), 88–110 (2006). https://doi.org/10.1504/ijdmb.2006.009923CrossRefGoogle Scholar