Abstract
Clustering of biomedical documents has become a vital research concept due to its importance in the clinical and telemedicine applications. The clustering of the medical documents is being considered as a major issue because of its unstructured nature. This paper focuses on developing an efficient document clustering approach for the medical documents to be utilized in telemedicine applications. Most existing models utilize n-gram techniques for phrase identification and term, concept or semantic based models for clustering applications. However n-gram does not perform well when the original document has been modified while only hybrid models provide relatively improved clustering. The proposed document clustering approach is named as enriched semantic smoothing model which has been developed on the concept of Mesh ontology. As the semantic smoothing model is not effective in handling the density of general words, an improved model with term frequency and inverse gravity moment (TF-IGM) factor and improved background elimination is used. Unlike term frequency and inverse document frequency), TF-IGM precisely measure the class distinguishing power of a term by making use of the fine-grained term distribution across different classes of text in documents. The modified n-gram technique, which detects the cases of substitution and deletion in the documents and averts them, improves the phrases identification. The clustering efficiency of the k-means clustering and hierarchical clustering algorithms is improved by utilizing the proposed model. The experiments are made on Mesh ontology based PubMed documents with similarity measures and cluster validity indexes used for comparisons. The results show that the proposed approach of medical document clustering is highly accurate and thus improves the concepts of clinical practices and telemedicine.
Similar content being viewed by others
References
Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 33–40. ACM (2001)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining vol. 400(1), pp. 525–526 (2000)
Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE (2001)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Saad, F.H., de la Iglesia, B., Bell, D.G.: A comparison of two document clustering approaches for clustering medical documents. In: DMIN, pp. 425–431 (2006)
Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2008)
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 186–193. ACM (2004)
Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 513–520. IEEE (2001)
Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. Artif. Intell. Med. 26(1), 1–24 (2002)
Prather, J.C., Lobach, D.F., Goodwin, L.K., Hales, J.W., Hage, M.L., Hammond, W.E.: Medical data mining: knowledge discovery in a clinical data warehouse. In: Proceedings of the AMIA Annual Fall Symposium, p. 101. American Medical Informatics Association (1997)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 541–544. IEEE (2003)
Jing, L., Zhou, L., Ng, M.K., Huang, J.Z.: Ontology-based distance measure for text clustering. In Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, MD (2006)
Yoo, I., Hu, X., Song, I.Y.: Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 791–796. ACM (2006)
Logeswari, S., Premalatha, K.: Ontology-based semantic smoothing model for biomedical document clustering. Int. J. Telemed. Clin. Pract. 1(1), 94–110 (2015)
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 29. ACM (2004)
Pan, J.Y., Zhang, J.S.: Relationship matrix nonnegative decomposition for clustering. Math. Probl. Eng. 2011, 842325 (2011)
Zhong, Y., Zhang, L.: A new fuzzy clustering algorithm based on clonal selection for land cover classification. Math. Probl. Eng. 2011(2), 253–266 (2011)
Lee, M., Wang, W., Yu, H.: Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinform. 7(1), 140 (2006)
Lin, J., Wilbur, W.J.: PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinform. 8(1), 423 (2007)
Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)
Nelson, S.J., Schopen, M., Savage, A.G., Schulman, J.L., Arluk, N.: The MeSH translation maintenance system: structure, interface design, and implementation. Stud. Health Technol. Inf. 11(Pt 1), 67–69 (2004)
Yoo, I., Hu, X., Song, I.Y.: Biomedical ontology improves biomedical literature clustering performance: a comparison study. Int. J. Bioinform. Res. Appl. 3(3), 414–428 (2007)
Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A comparative study of ontology based term similarity measures on PubMed document clustering. In: Concepts, Systems and Applications, Advances in Databases, pp. 115–126 (2007)
Zhu, S., Zeng, J., Mamitsuka, H.: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 25(15), 1944–1951 (2009)
Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl 1), S145–S154 (2002)
Pan, W.: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 22(7), 795–801 (2006)
Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10), 1259–1268 (2006)
Shiga, M., Takigawa, I., Mamitsuka, H.: Annotating gene function by combining expression data with a modular gene network. Bioinformatics 23(13), i468–i478 (2007)
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Ji, X., Xu, W.: Document clustering with prior knowledge. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 405–412. ACM (2006)
Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Induced lexico-syntactic patterns improve information extraction from online medical forums. J. Am. Med. Inf. Assoc. 21(5), 902–909 (2014)
Xu, Y., Hong, K., Tsujii, J., Chang, E.I.C.: Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries. J. Am. Med. Inf. Assoc. 19(5), 824–832 (2012)
Ghoulam, A., Barigou, F., Belalem, G., Meziane, F.: Using local grammar for entity extraction from clinical reports. IJIMAI 3(3), 16–24 (2015)
Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., Solti, I.: Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J. Am. Med. Inf. Assoc. 20(1), 84–94 (2013)
Ling, Y., Pan, X., Li, G., Hu, X.: Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization. IEEE Trans. Nanobiosci. 14(5), 500–504 (2015)
Hübner, A., Walther, M., Kuhn, H.: Approach to clustering clinical departments. In: Health Care Systems Engineering for Scientists and Practitioners, pp. 111–120. Springer (2016)
Jun, S., Park, S.S., Jang, D.S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications 41(7), 3204–3212 (2014)
Karaa, W.B.A., Ashour, A.S., Sassi, D.B., Roy, P., Kausar, N., Dey, N.: Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of Intelligent Optimization in Biology and Medicine, pp. 267–287. Springer (2016)
Al-Ariki, H.D.E., Swamy, M.S.: A survey and analysis of multipath routing protocols in wireless multimedia sensor networks. Wirel. Netw. 23(6), 1823–1835 (2017)
Celebi, M.E. (ed.).: Partitional Clustering Algorithms. Springer, Cham (2014)
Chen, K., Zhang, Z., Long, J., Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)
Barrón-Cedeño, A., Rosso, P.: On Automatic Plagiarism Detection Based on n-Grams Comparison. Advances in Information Retrieval, pp. 696-700. Springer, Berlin (2009)
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243–256 (2013)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sandhiya, R., Sundarambal, M. Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications. Cluster Comput 22 (Suppl 2), 3213–3230 (2019). https://doi.org/10.1007/s10586-018-2023-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-018-2023-4