Skip to main content
Log in

Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

Clustering of biomedical documents has become a vital research concept due to its importance in the clinical and telemedicine applications. The clustering of the medical documents is being considered as a major issue because of its unstructured nature. This paper focuses on developing an efficient document clustering approach for the medical documents to be utilized in telemedicine applications. Most existing models utilize n-gram techniques for phrase identification and term, concept or semantic based models for clustering applications. However n-gram does not perform well when the original document has been modified while only hybrid models provide relatively improved clustering. The proposed document clustering approach is named as enriched semantic smoothing model which has been developed on the concept of Mesh ontology. As the semantic smoothing model is not effective in handling the density of general words, an improved model with term frequency and inverse gravity moment (TF-IGM) factor and improved background elimination is used. Unlike term frequency and inverse document frequency), TF-IGM precisely measure the class distinguishing power of a term by making use of the fine-grained term distribution across different classes of text in documents. The modified n-gram technique, which detects the cases of substitution and deletion in the documents and averts them, improves the phrases identification. The clustering efficiency of the k-means clustering and hierarchical clustering algorithms is improved by utilizing the proposed model. The experiments are made on Mesh ontology based PubMed documents with similarity measures and cluster validity indexes used for comparisons. The results show that the proposed approach of medical document clustering is highly accurate and thus improves the concepts of clinical practices and telemedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 33–40. ACM (2001)

  2. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining vol. 400(1), pp. 525–526 (2000)

  3. Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE (2001)

  4. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)

  5. Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)

    Article  Google Scholar 

  6. Saad, F.H., de la Iglesia, B., Bell, D.G.: A comparison of two document clustering approaches for clustering medical documents. In: DMIN, pp. 425–431 (2006)

  7. Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2008)

  8. Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 186–193. ACM (2004)

  9. Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 513–520. IEEE (2001)

  10. Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. Artif. Intell. Med. 26(1), 1–24 (2002)

    Article  Google Scholar 

  11. Prather, J.C., Lobach, D.F., Goodwin, L.K., Hales, J.W., Hage, M.L., Hammond, W.E.: Medical data mining: knowledge discovery in a clinical data warehouse. In: Proceedings of the AMIA Annual Fall Symposium, p. 101. American Medical Informatics Association (1997)

  12. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)

    Article  Google Scholar 

  13. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 541–544. IEEE (2003)

  14. Jing, L., Zhou, L., Ng, M.K., Huang, J.Z.: Ontology-based distance measure for text clustering. In Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, MD (2006)

  15. Yoo, I., Hu, X., Song, I.Y.: Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 791–796. ACM (2006)

  16. Logeswari, S., Premalatha, K.: Ontology-based semantic smoothing model for biomedical document clustering. Int. J. Telemed. Clin. Pract. 1(1), 94–110 (2015)

    Article  Google Scholar 

  17. Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 29. ACM (2004)

  18. Pan, J.Y., Zhang, J.S.: Relationship matrix nonnegative decomposition for clustering. Math. Probl. Eng. 2011, 842325 (2011)

    Article  MathSciNet  Google Scholar 

  19. Zhong, Y., Zhang, L.: A new fuzzy clustering algorithm based on clonal selection for land cover classification. Math. Probl. Eng. 2011(2), 253–266 (2011)

    Google Scholar 

  20. Lee, M., Wang, W., Yu, H.: Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinform. 7(1), 140 (2006)

    Article  Google Scholar 

  21. Lin, J., Wilbur, W.J.: PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinform. 8(1), 423 (2007)

    Article  Google Scholar 

  22. Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)

    Article  Google Scholar 

  23. Nelson, S.J., Schopen, M., Savage, A.G., Schulman, J.L., Arluk, N.: The MeSH translation maintenance system: structure, interface design, and implementation. Stud. Health Technol. Inf. 11(Pt 1), 67–69 (2004)

    Google Scholar 

  24. Yoo, I., Hu, X., Song, I.Y.: Biomedical ontology improves biomedical literature clustering performance: a comparison study. Int. J. Bioinform. Res. Appl. 3(3), 414–428 (2007)

    Article  Google Scholar 

  25. Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A comparative study of ontology based term similarity measures on PubMed document clustering. In: Concepts, Systems and Applications, Advances in Databases, pp. 115–126 (2007)

  26. Zhu, S., Zeng, J., Mamitsuka, H.: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 25(15), 1944–1951 (2009)

    Article  Google Scholar 

  27. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl 1), S145–S154 (2002)

    Article  Google Scholar 

  28. Pan, W.: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 22(7), 795–801 (2006)

    Article  Google Scholar 

  29. Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10), 1259–1268 (2006)

    Article  Google Scholar 

  30. Shiga, M., Takigawa, I., Mamitsuka, H.: Annotating gene function by combining expression data with a modular gene network. Bioinformatics 23(13), i468–i478 (2007)

    Article  Google Scholar 

  31. Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)

    Google Scholar 

  32. Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  33. Ji, X., Xu, W.: Document clustering with prior knowledge. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 405–412. ACM (2006)

  34. Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Induced lexico-syntactic patterns improve information extraction from online medical forums. J. Am. Med. Inf. Assoc. 21(5), 902–909 (2014)

    Article  Google Scholar 

  35. Xu, Y., Hong, K., Tsujii, J., Chang, E.I.C.: Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries. J. Am. Med. Inf. Assoc. 19(5), 824–832 (2012)

    Article  Google Scholar 

  36. Ghoulam, A., Barigou, F., Belalem, G., Meziane, F.: Using local grammar for entity extraction from clinical reports. IJIMAI 3(3), 16–24 (2015)

    Article  Google Scholar 

  37. Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., Solti, I.: Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J. Am. Med. Inf. Assoc. 20(1), 84–94 (2013)

    Article  Google Scholar 

  38. Ling, Y., Pan, X., Li, G., Hu, X.: Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization. IEEE Trans. Nanobiosci. 14(5), 500–504 (2015)

    Article  Google Scholar 

  39. Hübner, A., Walther, M., Kuhn, H.: Approach to clustering clinical departments. In: Health Care Systems Engineering for Scientists and Practitioners, pp. 111–120. Springer (2016)

  40. Jun, S., Park, S.S., Jang, D.S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications 41(7), 3204–3212 (2014)

    Article  Google Scholar 

  41. Karaa, W.B.A., Ashour, A.S., Sassi, D.B., Roy, P., Kausar, N., Dey, N.: Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of Intelligent Optimization in Biology and Medicine, pp. 267–287. Springer (2016)

  42. Al-Ariki, H.D.E., Swamy, M.S.: A survey and analysis of multipath routing protocols in wireless multimedia sensor networks. Wirel. Netw. 23(6), 1823–1835 (2017)

    Article  Google Scholar 

  43. Celebi, M.E. (ed.).: Partitional Clustering Algorithms. Springer, Cham (2014)

  44. Chen, K., Zhang, Z., Long, J., Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)

    Article  Google Scholar 

  45. Barrón-Cedeño, A., Rosso, P.: On Automatic Plagiarism Detection Based on n-Grams Comparison. Advances in Information Retrieval, pp. 696-700. Springer, Berlin (2009)

  46. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243–256 (2013)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to R. Sandhiya.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sandhiya, R., Sundarambal, M. Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications. Cluster Comput 22 (Suppl 2), 3213–3230 (2019). https://doi.org/10.1007/s10586-018-2023-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-018-2023-4

Keywords

Navigation