Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications

Sandhiya, R.; Sundarambal, M.

doi:10.1007/s10586-018-2023-4

Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications

Published: 20 March 2018

Volume 22, pages 3213–3230, (2019)
Cite this article

Cluster Computing Aims and scope Submit manuscript

R. Sandhiya¹ &
M. Sundarambal²

415 Accesses
5 Citations
Explore all metrics

Abstract

Clustering of biomedical documents has become a vital research concept due to its importance in the clinical and telemedicine applications. The clustering of the medical documents is being considered as a major issue because of its unstructured nature. This paper focuses on developing an efficient document clustering approach for the medical documents to be utilized in telemedicine applications. Most existing models utilize n-gram techniques for phrase identification and term, concept or semantic based models for clustering applications. However n-gram does not perform well when the original document has been modified while only hybrid models provide relatively improved clustering. The proposed document clustering approach is named as enriched semantic smoothing model which has been developed on the concept of Mesh ontology. As the semantic smoothing model is not effective in handling the density of general words, an improved model with term frequency and inverse gravity moment (TF-IGM) factor and improved background elimination is used. Unlike term frequency and inverse document frequency), TF-IGM precisely measure the class distinguishing power of a term by making use of the fine-grained term distribution across different classes of text in documents. The modified n-gram technique, which detects the cases of substitution and deletion in the documents and averts them, improves the phrases identification. The clustering efficiency of the k-means clustering and hierarchical clustering algorithms is improved by utilizing the proposed model. The experiments are made on Mesh ontology based PubMed documents with similarity measures and cluster validity indexes used for comparisons. The results show that the proposed approach of medical document clustering is highly accurate and thus improves the concepts of clinical practices and telemedicine.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Article 28 November 2018

Information extraction from electronic medical documents: state of the art and future research directions

Article 08 November 2022

Keyphrase extraction using graph-based statistical approach with NLP patterns

Article 05 May 2024

References

Leuski, A.: Evaluating document clustering for interactive information retrieval. In: Proceedings of the Tenth International Conference on Information and Knowledge Management, pp. 33–40. ACM (2001)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining vol. 400(1), pp. 525–526 (2000)
Ding, C.H., He, X., Zha, H., Gu, M., Simon, H.D.: A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 107–114. IEEE (2001)
Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Article Google Scholar
Saad, F.H., de la Iglesia, B., Bell, D.G.: A comparison of two document clustering approaches for clustering medical documents. In: DMIN, pp. 425–431 (2006)
Wan, X., Yang, J.: Multi-document summarization using cluster-based link analysis. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 299–306. ACM (2008)
Liu, X., Croft, W.B.: Cluster-based retrieval using language models. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 186–193. ACM (2004)
Silva, J., Mexia, J., Coelho, A., Lopes, G.: Document clustering and cluster topic extraction in multilingual corpora. In: Proceedings IEEE International Conference on Data Mining, 2001. ICDM 2001, pp. 513–520. IEEE (2001)
Cios, K.J., Moore, G.W.: Uniqueness of medical data mining. Artif. Intell. Med. 26(1), 1–24 (2002)
Article Google Scholar
Prather, J.C., Lobach, D.F., Goodwin, L.K., Hales, J.W., Hage, M.L., Hammond, W.E.: Medical data mining: knowledge discovery in a clinical data warehouse. In: Proceedings of the AMIA Annual Fall Symposium, p. 101. American Medical Informatics Association (1997)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)
Article Google Scholar
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, 2003. ICDM 2003, pp. 541–544. IEEE (2003)
Jing, L., Zhou, L., Ng, M.K., Huang, J.Z.: Ontology-based distance measure for text clustering. In Proceedings of SIAM SDM Workshop on Text Mining, Bethesda, MD (2006)
Yoo, I., Hu, X., Song, I.Y.: Integration of semantic-based bipartite graph representation and mutual refinement strategy for biomedical literature clustering. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 791–796. ACM (2006)
Logeswari, S., Premalatha, K.: Ontology-based semantic smoothing model for biomedical document clustering. Int. J. Telemed. Clin. Pract. 1(1), 94–110 (2015)
Article Google Scholar
Ding, C., He, X.: K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning, p. 29. ACM (2004)
Pan, J.Y., Zhang, J.S.: Relationship matrix nonnegative decomposition for clustering. Math. Probl. Eng. 2011, 842325 (2011)
Article MathSciNet Google Scholar
Zhong, Y., Zhang, L.: A new fuzzy clustering algorithm based on clonal selection for land cover classification. Math. Probl. Eng. 2011(2), 253–266 (2011)
Google Scholar
Lee, M., Wang, W., Yu, H.: Exploring supervised and unsupervised methods to detect topics in biomedical text. BMC Bioinform. 7(1), 140 (2006)
Article Google Scholar
Lin, J., Wilbur, W.J.: PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinform. 8(1), 423 (2007)
Article Google Scholar
Theodosiou, T., Darzentas, N., Angelis, L., Ouzounis, C.A.: PuReD-MCL: a graph-based PubMed document clustering methodology. Bioinformatics 24(17), 1935–1941 (2008)
Article Google Scholar
Nelson, S.J., Schopen, M., Savage, A.G., Schulman, J.L., Arluk, N.: The MeSH translation maintenance system: structure, interface design, and implementation. Stud. Health Technol. Inf. 11(Pt 1), 67–69 (2004)
Google Scholar
Yoo, I., Hu, X., Song, I.Y.: Biomedical ontology improves biomedical literature clustering performance: a comparison study. Int. J. Bioinform. Res. Appl. 3(3), 414–428 (2007)
Article Google Scholar
Zhang, X., Jing, L., Hu, X., Ng, M., Zhou, X.: A comparative study of ontology based term similarity measures on PubMed document clustering. In: Concepts, Systems and Applications, Advances in Databases, pp. 115–126 (2007)
Zhu, S., Zeng, J., Mamitsuka, H.: Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics 25(15), 1944–1951 (2009)
Article Google Scholar
Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl 1), S145–S154 (2002)
Article Google Scholar
Pan, W.: Incorporating gene functions as priors in model-based clustering of microarray gene expression data. Bioinformatics 22(7), 795–801 (2006)
Article Google Scholar
Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10), 1259–1268 (2006)
Article Google Scholar
Shiga, M., Takigawa, I., Mamitsuka, H.: Annotating gene function by combining expression data with a modular gene network. Bioinformatics 23(13), i468–i478 (2007)
Article Google Scholar
Wagstaff, K., Cardie, C., Rogers, S., Schrödl, S.: Constrained k-means clustering with background knowledge. ICML 1, 577–584 (2001)
Google Scholar
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Article Google Scholar
Ji, X., Xu, W.: Document clustering with prior knowledge. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 405–412. ACM (2006)
Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Induced lexico-syntactic patterns improve information extraction from online medical forums. J. Am. Med. Inf. Assoc. 21(5), 902–909 (2014)
Article Google Scholar
Xu, Y., Hong, K., Tsujii, J., Chang, E.I.C.: Feature engineering combined with machine learning and rule-based methods for structured information extraction from narrative clinical discharge summaries. J. Am. Med. Inf. Assoc. 19(5), 824–832 (2012)
Article Google Scholar
Ghoulam, A., Barigou, F., Belalem, G., Meziane, F.: Using local grammar for entity extraction from clinical reports. IJIMAI 3(3), 16–24 (2015)
Article Google Scholar
Deleger, L., Molnar, K., Savova, G., Xia, F., Lingren, T., Li, Q., Solti, I.: Large-scale evaluation of automated clinical note de-identification and its impact on information extraction. J. Am. Med. Inf. Assoc. 20(1), 84–94 (2013)
Article Google Scholar
Ling, Y., Pan, X., Li, G., Hu, X.: Clinical documents clustering based on medication/symptom names using multi-view nonnegative matrix factorization. IEEE Trans. Nanobiosci. 14(5), 500–504 (2015)
Article Google Scholar
Hübner, A., Walther, M., Kuhn, H.: Approach to clustering clinical departments. In: Health Care Systems Engineering for Scientists and Practitioners, pp. 111–120. Springer (2016)
Jun, S., Park, S.S., Jang, D.S.: Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications 41(7), 3204–3212 (2014)
Article Google Scholar
Karaa, W.B.A., Ashour, A.S., Sassi, D.B., Roy, P., Kausar, N., Dey, N.: Medline text mining: an enhancement genetic algorithm based approach for document clustering. In: Applications of Intelligent Optimization in Biology and Medicine, pp. 267–287. Springer (2016)
Al-Ariki, H.D.E., Swamy, M.S.: A survey and analysis of multipath routing protocols in wireless multimedia sensor networks. Wirel. Netw. 23(6), 1823–1835 (2017)
Article Google Scholar
Celebi, M.E. (ed.).: Partitional Clustering Algorithms. Springer, Cham (2014)
Chen, K., Zhang, Z., Long, J., Zhang, H.: Turning from TF-IDF to TF-IGM for term weighting in text classification. Expert Syst. Appl. 66, 245–260 (2016)
Article Google Scholar
Barrón-Cedeño, A., Rosso, P.: On Automatic Plagiarism Detection Based on n-Grams Comparison. Advances in Information Retrieval, pp. 696-700. Springer, Berlin (2009)
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J.M., Perona, I.: An extensive comparative study of cluster validity indices. Pattern Recognit. 46(1), 243–256 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Information Technology, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu, 641014, India
R. Sandhiya
Department of Electrical and Electronics Engineering, Coimbatore Institute of Technology, Coimbatore, Tamil Nadu, 641014, India
M. Sundarambal

Authors

R. Sandhiya
View author publications
You can also search for this author in PubMed Google Scholar
M. Sundarambal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to R. Sandhiya.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sandhiya, R., Sundarambal, M. Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications. Cluster Comput 22 (Suppl 2), 3213–3230 (2019). https://doi.org/10.1007/s10586-018-2023-4

Download citation

Received: 26 December 2017
Revised: 31 January 2018
Accepted: 02 February 2018
Published: 20 March 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s10586-018-2023-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Information extraction from electronic medical documents: state of the art and future research directions

Keyphrase extraction using graph-based statistical approach with NLP patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applications

Abstract

Access this article

Similar content being viewed by others

Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey

Information extraction from electronic medical documents: state of the art and future research directions

Keyphrase extraction using graph-based statistical approach with NLP patterns

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation