Skip to main content

Document clustering of MEDLINE abstracts based on non-negative matrix factorization using local confidence assessment


A document search in PubMed is certainly one of the most exhaustive ways for finding information related to any biological or biomedical topic. However, a keyword search in this database that is not specific enough will provide a number of results that exceeds by far an amount of documents the user can read through one by one. In this work, we therefore present a new document clustering tool called Med-Clus for bioinformaticians in order to make a keyword search result from PubMed more concise by grouping such a set of documents into clusters. MedClus contains two modules. First, a pre-clustering module that creates the data matrix. This matrix contains term-document frequencies according to the TF*IDF method and optional weights. These weights are given by comparing the term list with the MeSH terms contained in the related MEDLINE abstracts. Second, it contains a clustering module, which is based on a Non-negative Matrix Factorization algorithm that finds an approximate factorization of the data matrix. This application was tested in different experiments evaluating its performance and reliability. Based on these results, a list of recommended ranges for crucial parameters such as the number of clusters was edited in order to constitute an user assistance for the application of Med-Clus. Finally, some results were analyzed by scientists from the field of medicine and biology, who evaluated the relevance of the terms and the existence of a relation between them. MedClus is a tool that is able to re-structure the result list of a keyword search for documents in PubMed. This is done by extracting terms before and finding latent semantics during the clustering process. Also, it optionally applies weights to terms that also appear as MeSH terms in at least one of the MEDLINE abstracts. Therefore, it helps users to refine a search result in PubMed via term-based clustering in order to economize time and efforts. At this development stage, the software is suitable for experienced users such as bioinformaticians, database administrators and developers. Also Web service for Semantic Toxicogenomics Knowledgebase, available at, has applied this technology to provide comprehensive and accurate relations between chemical and toxicological contexts.

This is a preview of subscription content, access via your institution.


  1. Douglas, S., Montelione, G. & Gerstein, M. PubNet: a flexible system for visualizing literature derived networks. Genome Biol. 6, R80 (2005).

    Article  Google Scholar 

  2. Eaton, A. HubMed: a web-based biomedical literature search interface. Nucleic Acids Res. 34, W745–7. (2006).

    CAS  Article  Google Scholar 

  3. Perez-Iratxeta, C., Bork, P. & Andrade, M. XplorMed: a tool for exploring MEDLINE abstracts. Trends Biochem. Sci. 26, 573–575 (2001).

    CAS  Article  Google Scholar 

  4. Chagoyen, M., Carmona-Saez, P., Shatkay, H., Carazo, J. & Pascual-Montano, A. Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 7, 41 (2006).

    Article  Google Scholar 

  5. Liu, Y. et al. Comparison of two schemes for automatic keyword extraction from MEDLINE for functional gene clustering. IEEE Comput. Syst. Bioinformatics Conf. 394–404 (2004).

  6. Li, T. & Ding, C. The relationships among various nonnegative matrix factorization methods for clustering. Proc. IEEE Int. Conf. Data Min. Dec 18–22; 362–371(2006).

  7. Lee, D. & Seung, H. Algorithms for Non-negative Matrix Factorization. Adv. Neural Inf. Process. Syst. 13, 556–562 (2001).

    Google Scholar 

  8. Lee, D. & Seung, H. Learning the parts of objects by nonnegative matrix factorization. Nature 401, 788–791 (1999).

    CAS  Article  Google Scholar 

  9. Xu, W., Liu, X. & Gong, Y. Document-Clustering based on Non-negative Matrix Factorization. Proc. ACM SIGIR Res. Dev. Inf. Retr. Aug 28, 267–273 (2003).

    Google Scholar 

  10. Ding, C., Li, T. & Jordan, M. Convex and semi-nonnegative matrix factorizations for clustering and lowdimension representation. Technical Report, LBNL-60428. Lawrence Berkeley National Laboratory (2006).

  11. Shahnaz, F. & Berry, M. Document clustering using nonnegative matrix factorization. Inf. Process Manag. 42, 373–386 (2006).

    Article  Google Scholar 

  12. Porter, M. An algorithm for suffix stripping. Program 14, 130–137 (1980).

    Google Scholar 

  13. Iliopoulos, I., Enright, A. & Ouzounis, C. Textquest. Document clustering of medline abstracts for concept discovery in molecular biology. Proc. Pac. Symp. Biocomput. Jan 2–7; 384–395 (2005).

  14. Shin, G., Kim, H., Lee, T., Park, J., & Kang, B. A novel semantic framework for toxicogenomics. J. Toxicol. Environ. Health. Sci. 2, 1–3 (2010)

    Google Scholar 

  15. Radenski, A. “Python first”: a lab-based digital introduction to computer science. Proceedings of the 11th annual SIGCSE conference on Innovation and technology in computer science education. June 26–28; 197–201 (2006).

  16. Spark-Jones, K. A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972).

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations


Corresponding authors

Correspondence to Chulhwan Park or Man-gi Cho.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Kang, BC., Sur, ZW., Park, C. et al. Document clustering of MEDLINE abstracts based on non-negative matrix factorization using local confidence assessment. BioChip J 4, 336–349 (2010).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Text-mining
  • Literature clustering
  • Non-negative matrix factorization
  • Local confidence assessment
  • Bioinformatics