Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Lu, Haohui; Uddin, Shahadat

doi:10.1007/s12553-023-00805-8

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Original Paper
Open access
Published: 29 December 2023

Volume 14, pages 141–154, (2024)
Cite this article

Download PDF

You have full access to this open access article

Health and Technology Aims and scope Submit manuscript

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Download PDF

1687 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Purpose

Disease risk prediction poses a significant and growing challenge in the medical field. While researchers have increasingly utilised machine learning (ML) algorithms to tackle this issue, supervised ML methods remain dominant. However, there is a rising interest in unsupervised techniques, especially in situations where data labels might be missing — as seen with undiagnosed or rare diseases. This study delves into comparing unsupervised ML models for disease prediction.

Methods

This study evaluated the efficacy of seven unsupervised algorithms on 15 datasets, including those of heart failure, diabetes, and breast cancer. It used six performance metrics for this comparison. They are Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, V-measure and Silhouette Coefficient.

Results

Among the seven unsupervised ML methods, the DBSCAN (Density-based spatial clustering of applications with noise) showed the best performance most times (31), followed by the Bayesian Gaussian Mixture (18) and Divisive clustering (15). No single model consistently outshined others across every dataset and metric. The study emphasises the crucial role of model and performance measure selections based on application-specific needs. For example, DBSCAN excels in Homogeneity, Completeness and V-measure metrics. Conversely, the Bayesian Gaussian Mixture is good in the Adjusted R and Index metric. The codes used in this study can be found at https://github.com/haohuilu/unsupervisedml/.

Conclusion

This research contributes deeper insights into the unsupervised ML applications in healthcare and encourages further investigations into model selection. Subsequent studies could harness genuine disease records for a more nuanced comparison and evaluation of models.

Big-Data Analysis, Cluster Analysis, and Machine-Learning Approaches

Review and Comparative Analysis of Unsupervised Machine Learning Application in Health Care

Hyperparameter optimization for cardiovascular disease data-driven prognostic system

Article Open access 01 August 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Machine learning (ML), a subfield of artificial intelligence, leverages computational methods to address challenges using historical data and information without requiring significant alterations to the fundamental process [1]. ML algorithms boast diverse applications, such as automated text classification [2], project analytics [3], spam email filtering [4], marketing analytics [5], and disease prediction [6]. They are primarily of two categories: supervised learning and unsupervised learning, with some researchers also acknowledging reinforcement learning algorithms that learn data patterns to respond to specific environments. Nevertheless, supervised learning and unsupervised learning are the most recognised types. The critical difference between these two categories lies in the existence of labels within the training data subset [1]. Supervised ML relies on labelled data. The dataset includes input features and corresponding output labels, allowing the algorithm to learn a mapping function to make predictions for test data or unseen data [7]. In contrast, unsupervised ML deals with unlabelled data, the dataset only consists of input features but no output labels. This method discovers patterns or clusters autonomously, without direct instructions [8].

The data science research community has recently shown an amplified interest in medical informatics, with disease prediction being a key area of focus [9]. Disease prediction plays a critical role in modern health. It allows for early treatments and improves patient outcomes. ML is a robust tool for predicting disease risk within intricate health data. ML methods can learn from past data to predict future disease risks. Many studies are comparing the performance of supervised ML in the disease prediction domain [10,11,12,13,14].

Nonetheless, there are limited comparative studies on unsupervised ML in the disease prediction domain, as it has not gained as much popularity as supervised ML [9]. Data labels are not always available, particularly in cases where patients have undiagnosed or rare diseases. Vats et al. [15] compared the unsupervised ML techniques for liver disease prediction. They employed DBSCAN (Density-based spatial clustering of applications with noise), k-means, and Affinity Propagation to compare their prediction accuracy and computational complexity. Antony et al. [16] proposed a framework that compares different unsupervised ML methods for chronic kidney disease prediction. Alashwal et al. [17] investigated various unsupervised methods for Alzheimer’s prediction, aiming to identify suitable techniques for patient grouping and their potential impact on treatment. Our research uncovered a gap in research, specifically a lack of thorough comparative studies of unsupervised learning algorithms across various types of disease prediction. As such, this research aims to evaluate the performance of different unsupervised ML algorithms in predicting diseases. It uses a variety of conditions, including heart failure, diabetes, and breast cancer, focusing on employing unsupervised ML techniques, such as k-means, DBSCAN and Agglomerative Clustering for disease prediction. The objective is to compare predictive performance by considering several performance measures, such as the Silhouette coefficient, Adjusted Mutual Information, Adjusted Rand Index, and V-measure. These measures are crucial in identifying the most effective approach for handling different datasets with numerous parameters. The key contributions of this research include:

Comprehensive analysis and comparison of various unsupervised ML algorithms for disease risk prediction, using diverse benchmark datasets and performance measures.
Identify the top-performing unsupervised ML method for healthcare researchers and stakeholders, which will eventually help select suitable techniques for enhanced disease risk prediction.

2 Methods

ML algorithms are primarily categorised into supervised and unsupervised learning based on the presence or absence of labels within the given data. Supervised learning uses labelled data, while unsupervised learning uses unlabelled data to discover patterns or clusters. This study focuses on different unsupervised learning methods in the disease prediction domain. They are partitioning clustering, model-based clustering, hierarchical clustering and density-based clustering.

2.1 Unsupervised machine learning algorithms

Unsupervised ML, also known as clustering, involves grouping data into clusters based on the similarity of their objectives within the same cluster while ensuring that they are dissimilar to objects in other clusters [8]. Clustering is a type of unsupervised classification since there are no predefined classes.

Figure 1 shows how unsupervised ML techniques classify three groups in a two-dimensional dataset. The dataset consists of 100 randomly generated data points divided into three groups based on their similarity. Different colours represent the clusters. On the scatter plot, the clusters are represented by different circles, and circular bounds have been placed around each cluster to visualise their boundaries better.

2.1.1 Partitioning clustering

Partitioning clustering requires the analyst to specify the number of clusters that should be generated. The k-means clustering is the most widely used method of partitioning clustering algorithms [18]. Figure 2 demonstrates the processes for the standard k-means clustering algorithm. The first step involves selecting $k$ points as the initial centroids. After that, we need to classifying data points based on the distance to the centroids of these $k$ clusters. Then, recomputing the centroid of each cluster based on classified points and repeating these steps until the centroids do not change. This study uses two popular k-means variants: classic k-means and Mini batch k-means [19].

2.1.2 Model-based clustering

Model-based clustering is another unsupervised ML method. It is a probabilistic approach to clustering that uses Gaussian Mixture Models (GMMs) to represent data as a mixture of Gaussian distributions [20]. GMM is a probabilistic model that attempts to fit a dataset to a combination of different Gaussian distributions. It evaluates the likelihood of each data point belonging to each cluster, as opposed to classic k-means clustering, which allocates each data point to a single cluster. This enables a more flexible and accurate representation of data distributions, mainly when dealing with overlapping or non-spherical clusters [20]. Figure 3 shows how a GMM model with four components is fitted to the data, and the resulting clusters are coloured. The GMM’s Gaussian distributions are shown by ellipses, demonstrating each distribution’s spread and direction and the probabilistic character of the clustering process. We also use Bayesian Gaussian Mixture (BGM) [21] for performance comparison.

2.1.3 Hierarchical clustering

Hierarchical clustering generates a group of nested clusters arranged in a hierarchical tree structure. This can be represented through a dendrogram, a tree-like diagram that documents the sequence of merges or splits [22]. Figure 4 shows an example of a dendrogram for the hierarchical clustering. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering is a method that starts with each point as its cluster. As the process progresses, the nearest pair of clusters is merged in each step. This merging continues until it culminates in a single cluster or a specific number of clusters, depending on the parameters set at the outset of the process [22]. On the other hand, Divisive clustering is a method that begins with a single, all-encompassing cluster. As the process evolves, a cluster is split at each step. This splitting continues until every cluster contains only an individual point or a predetermined number of clusters are achieved, depending on the initial setup of the procedure [22]. This study uses both Agglomerative clustering and Divisive clustering for comparison.

2.1.4 Density-based

The density-based method relies on density as the local cluster criterion, such as points connected by density. Characteristics and features of density-based clustering include identifying clusters of any shape. It also effectively handles noise within the data. It requires only a single scan, examining the local region to validate the density. However, it necessitates the specification of density parameters as a condition for termination [22]. Density-based spatial clustering of applications with noise (DBSCAN) is a famous example of a density-based method [23]. This method labels high-density areas as clusters and low-density areas as outliers. It helps discover clusters of varied forms and deal with noise without requiring a set number of clusters [23]. Figure 5 visualises the clusters using the DBSCAN method.

2.2 Performance comparison measures

The performance of various unsupervised ML methods is assessed using different evaluation techniques, such as Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, Completeness, V-measure, and Silhouette Coefficient. These are applied to establish comparative performance metrics in this study.

2.2.1 Adjusted Rand Index

Adjusted Rand Index (ARI) is a modification to the Rand index. It calculates a similarity metric between two clusters by considering all sample pairs and then counting those pairs that are either similarly or differently assigned in the predicted and actual clusters [24]. The formular for ARI is

$$\mathrm{ARI} = \frac{\mathrm{RI} - \mathrm{Expecte\, RI}}{\mathrm{Max(RI)} - \mathrm{Expected\, RI}}$$

where $RI$ is the Rand Index, $Expected \,RI$ is the Expected Rand Index, and $Max\,(RI)$ is the maximum possible Rand Index.

The ARI value lies between -1 and 1, where 1 means identical clustering and –1 means dissimilar clustering. If the ARI is equal to 0, it indicates random labelling.

2.2.2 Adjusted Mutual Information

Adjusted mutual information (AMI) modifies the Mutual information (MI) score to account for chance [25]. It acknowledges that MI tends to increase with larger clusters, independent of the actual amount of shared information between them. The formula for AMI is

$${\mathrm {AMI}(\mathrm {X,Y})}=\frac{\mathrm{MI}-\mathrm{Expected\, MI}}{\mathrm{Max}(\mathrm H(\mathrm X),\mathrm H(\mathrm Y))-\mathrm{Expected\, MI}}$$

where $MI$ is the mutual information, $H(X)$ and $H(Y)$ are the entropies of X and Y, and $Expected \, MI$ is the expected mutual information. AMI ranges from 0 to 1, where a score of 1 indicates perfect agreement between two clustering. A score close to 0 suggests largely independent clustering or a result no better than random chance.

2.2.3 Homogeneity

Homogeneity is a clustering measure that compares the outcomes to a ground truth. It denotes that a cluster is homogenous if made up entirely of data points from a single class [26]. The formula for Homogeneity is

$$\mathrm{Homogeneity} = 1-\frac{\mathrm{H}({\mathrm{Y}}_{\mathrm{true}}|{Y}_{\mathrm{predict}})}{\mathrm{H}({Y}_{true})}$$

where ${Y}_{true}$ is the ground truth, ${Y}_{predict}$ is the predicted clusters. $H({Y}_{true}|{Y}_{predict})$ is the conditional entropy of the ground truth given the cluster predictions. $H({Y}_{true})$ is the entropy of the ground truth. Homogeneity is a metric that also varies between 0 and 1. A score of 1 means that each cluster contains only members of a single class, signifying perfect Homogeneity. A score of 0 indicates that the clusters are randomly assigned, lacking any homogeneity.

2.2.4 Completeness

Completeness is another clustering evaluation metric determining whether all data points in a given class are clustered. The clustering result is deemed complete when each class is contained inside a single cluster [26]. The formula for this measure is

$$\mathrm{Completeness} = 1-\frac{H({Y}_{predict}|{Y}_{true})}{H({Y}_{predict})}$$

where ${Y}_{true}$ is the ground truth, ${Y}_{predict}$ is the predicted clusters. $H({Y}_{predict}|{Y}_{true})$ is the conditional entropy of the cluster predictions given the ground truth. $H({Y}_{predict})$ is the entropy of the cluster predictions. Completeness ranges from 0 to 1. A score of 1 is achieved when all class members are assigned to the same cluster, indicating complete capture of all classes within the clusters. A score of 0 would imply that the clustering assignments are completely scattered without capturing the essence of classes.

2.2.5 V-measure

The V-measure is the harmonic mean between Homogeneity and Completeness [26], and the formula is

$$V-measure = \frac{2 \times (Homogeneity \times Completeness)}{Homogeneity +Completeness}$$

The V-measure score lies between 0 and 1, where 1 stands for perfectly complete and homogeneous labelling. V-measure ranged from 0 to 1. A score of 1 represents perfect clustering with both complete capture of all classes within clusters and each cluster containing only members of a single class. A score of 0 would indicate that the clustering fails on both homogeneity and completeness grounds.

2.2.6 Silhouette coefficient

The silhouette coefficient is used in cluster analysis to assess clustering quality. It computes the distance between each data point in one cluster and the points in neighbouring clusters, measuring how well each data point fits into its allocated cluster [27]. The formula is

$$\mathrm{Silhouette}= \frac{{b}_{i} - {a}_{i}}{max({a}_{i}, {b}_{i})}$$

where ${a}_{i}$ is the average distance inside the cluster, and ${b}_{i}$ is the average distance nearest other clusters. Silhouette Coefficient values range from -1 to 1. A score of 1 denotes that the clusters are well apart from each other and clearly distinguished. A score of 0 indicates overlapping clusters. A negative value suggests that data points might have been assigned to the wrong clusters. This metric gives a perspective on the distance and separation between the formed clusters.

3 Research dataset

Table 1 presents the datasets utilised in this study, outlining their respective attributes, including the number of features and data size. These datasets were sourced from the UCI Machine Learning Repository [28] and Kaggle [29]. This research uses the original data without preprocessing to ensure an unbiased comparison. We drop any entries with missing values.

Table 1 A summary of the dataset

Full size table

4 Results

We employed the default parameters provided by Sklearn for training our unsupervised ML models [44]. Tables 2, 3, 4, 5, 6 and 7 show various models' ARI, AMI, Homogeneity, Completeness, V-measure and Silhouette metrics. They have been trained on our research datasets.

Table 2 Adjusted Rand Index (ARI) comparison among unsupervised machine learning models

Full size table

Table 3 Adjusted Mutual Information (AMI) comparison among unsupervised machine learning models

Full size table

Table 4 Homogeneity comparison among unsupervised machine learning models

Full size table

Table 5 Completeness comparison among unsupervised machine learning models

Full size table

Table 6 V-measure comparison among unsupervised machine learning models

Full size table

Table 7 Silhouette comparison among unsupervised machine learning models

Full size table

Table 2 illustrates the ARI of 15 datasets. Based on these 15 datasets, the best-performing method is the Divisive clustering for D12 (0.8510), followed by BGM for D5 (0.6413). Performance varied widely across methods and datasets, underlining the necessity of testing multiple techniques. For AMI, the highest-performing models changed by dataset (Table 3). For instance, the best performance was observed with Divisive clustering in dataset D12 (0.7504), followed by BGM in dataset D5 (0.5337). For Homogeneity (Table 4), DBSCAN performs remarkably well on eight datasets, while the Divisive clustering performs best on D1 and D12. Regarding Completeness (Table 5), DBSCAN performs best on seven datasets, while BGM and Divisive clustering showed strong results on three datasets. DBSCAN has revealed the best performance on eight datasets with the V-measure metric. Evaluating the Silhouette score, the Agglomerative clustering dominated four datasets.

Additionally, Table 8 illustrates how often each model scored the highest in any given measure. DBSCAN showed the best performance most times (31), followed by BGM (18), Divisive clustering (15) and Agglomerative clustering (14). For individual performance metrics, the DBSCAN is the top performer regarding Homogeneity, Completeness and V-measure. BGM did well against the ARI and AMI metrics. Unsupervised MLs based on k-means (Classic and Mini Batch) showed the most minor performance.

Table 8 Comparison of unsupervised machine learning models showing the number of times they presented the highest measurement

Full size table

The best model to choose will depend on the particulars of a specific application and the performance indicators that are most important to the stakeholders. From above, the DBSCAN model received the highest score among 15 datasets, demonstrating the best overall performance. However, DBSCAN is sensitive to parameter settings and may struggle with clusters of varying densities, whereas Divisive clustering does not rely on specific parameter settings and is better at handling clusters with different densities. Additionally, unlike Divisive clustering, DBSCAN can face challenges in high-dimensional spaces and in preserving the global structure of data. A critical observation is the wide range of variance in model performance across different datasets, although DBSAN dominated in most cases. This could be reflective of the innate differences in data distribution, noise, and feature relevance. This variation underscores the need for a sophisticated and discerning approach to choosing the appropriate unsupervised ML model, carefully weighing the dataset's unique properties alongside each model's inherent advantages.

Furthermore, the fact that DBSCAN consistently exhibits high Homogeneity, Completeness and V-measure implies that this model is particularly well-suited for datasets where classes are separated by density. This insight could prove invaluable for practitioners dealing with such data characteristics. Conversely, the strong performance of BGM in the ARI and AMI metrics across various datasets indicates its potential as a versatile model capable of capturing the structure of the data with a reasonable balance between cluster purity and recovery.

The Python code used to implement the unsupervised machine learning models considered in this study is available at https://github.com/haohuilu/unsupervisedml/.

5 Discussion

This research compares unsupervised machine learning models applied to eight different health-related datasets. The datasets were sourced from the UCI Machine Learning Repository and encompass a variety of health issues, including heart disease, diabetes, and multiple forms of cancer. These datasets exhibit diverse numbers of features and sizes. The primary goal of this study was to contrast the performance of these models across multiple measures without undertaking any data preparation, ensuring an unbiased comparison.

The performance of the models differed based on the dataset and the evaluation metrics employed: ARI, AMI, Homogeneity, Completeness, V-measure, and Silhouette. Each metric provides a distinct insight into clustering quality. ARI and AMI measure the clustering against the ground truth. Homogeneity evaluates if each cluster solely comprises members of a single class, while Completeness assesses if all members of a specific class are grouped into the same cluster. V-measure is a harmonic mean of these two, and Silhouette gauges cluster separation and cohesion. DBSCAN’s excellent performance in Homogeneity suggests its robustness in capturing dense clusters, but it also flags potential shortcomings in handling data with varying densities or noise levels.

Meanwhile, the BGM model ranks second in overall performance across 15 datasets. It shows notable strength in ARI scores for five datasets and AMI scores for four. However, its high computational requirements might limit its use in large datasets or those needing immediate analysis. BGM models excel at autonomously determining cluster numbers in complex datasets and resist overfitting by integrating prior distributions. However, their high computational demand and reduced effectiveness with non-Gaussian data or inappropriate priors are notable drawbacks [45]. The third best performing model, Divisive clustering, designed for sequencing datasets like life-course histories, leverages Classification and Regression Tree analysis principles, including tree pruning, to predict cluster counts. It excels in hierarchical, large datasets by uncovering complex relationships but can struggle with overlapping or non-hierarchical data, leading to less accurate clustering [46]. Moreover, the consistent performance of Agglomerative Clustering in terms of the Silhouette score suggests its potential utility in datasets where clear separation between clusters is present. Nevertheless, Mini Batch k-means offers an alternative that might better manage noise while sacrificing some degree of performance due to its inherent randomness.

The selection of models in unsupervised learning tasks is nuanced and contextual. For instance, while hierarchical methods like agglomerative and divisive clustering do not require the specification of the number of clusters, their computational intensity and potential to create unbalanced hierarchies must be considered, especially for large datasets. In the literature, the application of unsupervised machine learning models in disease prediction must be judicious, considering the unique characteristics of healthcare data. For example, k-means is known for its efficiency and has been widely used in medical data analysis for its simplicity [47]. However, its performance can be hindered by the requirement to specify the number of clusters and its sensitivity to outliers [48]. DBSCAN is favoured for its ability to find clusters of arbitrary shapes and sizes, which is often suitable for the complex patterns present in medical datasets [49]. Yet, its performance can degrade with varying density clusters. The Gaussian Mixture Model offers flexibility due to its probabilistic nature and can accommodate the varied distribution of medical data [20], though it can be computationally intensive, which may not be optimal for all applications. Experts agree that there is no one-size-fits-all model, and the choice should depend on the specific requirements of the data and the task at hand [50].

To sum up, while DBSCAN frequently emerged as the top performer, no singular model consistently outshone others across every dataset and metric. The choice of model should be influenced by the unique attributes of the dataset and the relevance of the evaluation metrics for the particular research or application context. This study serves as a valuable reference for future unsupervised learning endeavours in health-related fields. It also emphasises the importance of continued exploration in model selection and optimisation techniques. The basic principles, pros and cons of various unsupervised models are detailed in Table 9.

Table 9 Basic principles, Pros and Cons of different unsupervised machine learning models

Full size table

6 Conclusion

This study comprehensively compared unsupervised learning models within the realm of disease prediction. The diversity of data types within this field, from heart disease to prostate cancer, demands a flexible approach to model selection. Based on the evaluated performance metrics, two models emerged as particularly promising: DBSCAN and BGM. The former demonstrated robust performance in the Homogeneity and V-measure. Conversely, BGM excelled in the ARI and AMI metric. This underscores DBSCAN’s aptitude for discerning densely populated clusters of similarity, even across heterogeneous datasets. Such findings highlight the potential prowess of these models in disease prediction. Their consistently high performance across diverse datasets indicates their capability to transcend the inherent challenges posed by the varied scales and ranges typical of medical data. Despite the intricate nature of medical datasets, these models succeeded in effectively clustering the data. The findings from this study serve not only as a testament to the capabilities of these models in transcending the challenges posed by medical datasets but also as a caveat to the user to be mindful of the models’ limitations. Future research directions could delve into applying deep learning models for predicting disease risks, drawing from an even broader pool of medical datasets. One of the most noteworthy attributes of unsupervised machine learning models is their flexible architecture, which facilitates adaptability and continuous enhancement. It is important to note that unsupervised learning is an evolving domain, and ongoing advancements in algorithm efficiency, model robustness and interpretability are expected to enhance further their application in disease prediction and other medical applications.

Data availability statement

This research used open-access public datasets for research investigation.

References

Alloghani M, Al-Jumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Supervised and unsupervised learning for data science. Springer; 2020. p. 3–21.
Chapter Google Scholar
Chen H, Wu L, Chen J, Lu W, Ding J. A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manage. 2022;59(2):102798.
Article Google Scholar
Uddin S, Ong S, Lu H. Machine learning in project analytics: a data-driven framework and case study. Sci Rep. 2022;12(1):15252.
Article Google Scholar
Jáñez-Martino F, Alaiz-Rodríguez R, González-Castro V, Fidalgo E, Alegre E. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev. 2023;56(2):1145–73.
Article Google Scholar
Miklosik A, Evans N. Impact of big data and machine learning on digital transformation in marketing: A literature review. Ieee Access. 2020;8:101284–92.
Article Google Scholar
Lu H, Uddin S. A disease network-based recommender system framework for predictive risk modelling of chronic diseases and their comorbidities. Appl Intell. 2022;52(9):10330–40.
Article Google Scholar
Singh A, Thakur N, Sharma A. A review of supervised machine learning algorithms. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom). Ieee; 2016.
Google Scholar
Hahne F, Huber W, Gentleman R, Falcon S, Gentleman R, Carey V. Unsupervised machine learning. In: Bioconductor case studies. Springer; 2008. p. 137–57.
Chapter Google Scholar
Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281.
Article Google Scholar
Katarya R, Meena SK. Machine learning techniques for heart disease prediction: a comparative study and analysis. Heal Technol. 2021;11:87–97.
Article Google Scholar
Rahman AS, Shamrat FJM, Tasnim Z, Roy J, Hossain SA. A comparative study on liver disease prediction using supervised machine learning algorithms. Int J Sci Technol Res. 2019;8(11):419–22.
Google Scholar
Shamrat FJM, Asaduzzaman M, Rahman AS, Tusher RTH, Tasnim Z. A comparative analysis of parkinson disease prediction using machine learning approaches. Int J Sci Technol Res. 2019;8(11):2576–80.
Google Scholar
Sinha P, Sinha P. Comparative study of chronic kidney disease prediction using KNN and SVM. Int J Eng Res Technol. 2015;4(12):608–12.
Google Scholar
Uddin S, Haque I, Lu H, Moni MA, Gide E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci Rep. 2022;12(1):1–11.
Article Google Scholar
Vats V, Zhang L, Chatterjee S, Ahmed S, Enziama E, Tepe K. A comparative analysis of unsupervised machine techniques for liver disease prediction. In: 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE; 2018.
Google Scholar
Antony L, Azam S, Ignatious E, Quadir R, Beeravolu AR, Jonkman M, De Boer F. A comprehensive unsupervised framework for chronic kidney disease prediction. IEEE Access. 2021;9:126481–501.
Article Google Scholar
Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.
Article Google Scholar
Hartigan JA, Wong MA. Algorithm AS 136: A k-means clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979;28(1):100–8.
Google Scholar
Sculley D. Web-scale k-means clustering. In: Proceedings of the 19th international conference on World wide web. 2010.
Google Scholar
Reynolds DA. Gaussian mixture models. In: Encyclopedia of biometrics, vol. 741. Springer; 2009. p. 659–63.
Chapter Google Scholar
Roberts SJ, Husmeier D, Rezek I, Penny W. Bayesian approaches to Gaussian mixture modeling. IEEE Trans Pattern Anal Mach Intell. 1998;20(11):1133–42.
Article Google Scholar
Han J, Pei J, Tong H. Data mining: concepts and techniques. Morgan kaufmann; 2022.
Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X. Density-based spatial clustering of applications with noise. Int. Conf. knowledge discovery and data mining; 1996.
Google Scholar
Steinley D. Properties of the hubert-arable adjusted rand index. Psychol Methods. 2004;9(3):386.
Article Google Scholar
Vinh NX, Epps J. Bailey, J2738784: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837–54.
MathSciNet Google Scholar
Rosenberg A, Hirschberg J. V-measure: a conditional entropy-based external cluster evaluation measure. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL); 2007.
Google Scholar
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Article Google Scholar
Asuncion A, Newman D, UCI machine learning repository. Irvine. USA: CA; 2007.
Google Scholar
Kaggle. Kaggle. 2023. www.kaggle.com. Cited 16 June 2023.
Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid J-J, Sandhu S, Guppy KH, Lee S, Froelicher V. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64(5):304–10.
Article Google Scholar
Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. 2020;20(1):1–16.
Article Google Scholar
Smith JW, Everhart JE, Dickson W, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the annual symposium on computer application in medical care. American Medical Informatics Association; 1988.
Google Scholar
Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. Oper Res. 1995;43(4):570–7.
Article MathSciNet Google Scholar
Machmud R, Wijaya A. Behavior determinant based cervical cancer early detection with machine learning algorithm. Adv Sci Lett. 2016;22(10):3120–3.
Article Google Scholar
Ramana BV, Babu MSP, Venkateswarlu N. A critical study of selected classification algorithms for liver disease diagnosis. Int J Database Manag Syst. 2011;3(2):101–14.
Article Google Scholar
Hong Z-Q, Yang J-Y. Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit. 1991;24(4):317–24.
Article MathSciNet Google Scholar
Quinlan R. Thyroid disease data set. 1987. https://archive.ics.uci.edu/ml/datasets/thyroid+disease. Accessed 3 Jul 2022.
Soundarapandian P, Rubini L, Eswaran P. Chronic kidney disease data set. Irvine, CA, USA: UCI Mach. Learn. Repository, School Inf. Comput. Sci., Univ. California; 2015.
Google Scholar
Lichman M, UCI machine learning repository. Irvine. USA: CA; 2013.
Google Scholar
Thabtah F, Kamalov F, Rajab K. A new computational intelligence approach to detect autistic features for autism screening. Int J Med Informatics. 2018;117:112–24.
Article Google Scholar
Mahmood S. Prostate cancer. 2023. https://www.kaggle.com/datasets/sajidsaifi/prostate-cancer. Cited 15 Jun 2023.
Patrício M, Pereira J, Crisóstomo J, Matafome P, Gomes M, Seiça R, Caramelo F. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer. 2018;18(1):1–8.
Article Google Scholar
Fernandes K, Cardoso JS, Fernandes J. Transfer learning with partial observability applied to cervical cancer screening. In: Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, June 20–23, 2017, Proceedings 8. Springer; 2017.
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
MathSciNet Google Scholar
Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 2015;11(4):e1004969.
Article Google Scholar
Chander S, Vijaya P. Unsupervised learning methods for data clustering. In: Artificial Intelligence in Data Mining. Elsevier; 2021. p. 41–64.
Chapter Google Scholar
Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66.
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl. 2013;40(1):200–10.
Article Google Scholar
Ester M, Kriegel H-P, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. kdd; 1996.
Google Scholar
Bouveyron C, Brunet-Saumard C. Model-based clustering of high-dimensional data: A review. Comput Stat Data Anal. 2014;71:52–78.
Article MathSciNet Google Scholar
McLachlan GJ, Lee SX, Rathnayake SI. Finite mixture models. Annual review of statistics and its application. 2019;6:355–78.
Article MathSciNet Google Scholar
Ghahramani Z, Beal M. Variational inference for Bayesian mixtures of factor analysers. In: Advances in neural information processing systems, vol. 12. NeurIPS; 1999.
Google Scholar
Ackermann MR, Blömer J, Kuntze D, Sohler C. Analysis of agglomerative clustering. Algorithmica. 2014;69:184–215.
Article MathSciNet Google Scholar
Sonagara D, Badheka S. Comparison of basic clustering algorithms. Int J Comput Sci Mob Comput. 2014;3(10):58–61.
Google Scholar
Khan K, Rehman SU, Aziz K, Fong S, Sarasvady S. DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014). IEEE; 2014.
Google Scholar

Download references

Funding

Open Access funding enabled and organized by CAUL and its Member Institutions This research received no specific grant from the public, commercial, or not-for-profit funding agencies.

Author information

Authors and Affiliations

School of Project Management, Faculty of Engineering, The University of Sydney, Level 2, 21 Ross Street, Forest Lodge, NSW, 2037, Australia
Haohui Lu & Shahadat Uddin

Authors

Haohui Lu
View author publications
You can also search for this author in PubMed Google Scholar
Shahadat Uddin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.L.: Writing, Data analysis and Research design; S.U.: Research design, Writing, Conceptualisation and Supervision.

Corresponding author

Correspondence to Shahadat Uddin.

Ethics declarations

Conflict of interests

The authors declare that they do not have any conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Lu, H., Uddin, S. Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets. Health Technol. 14, 141–154 (2024). https://doi.org/10.1007/s12553-023-00805-8

Download citation

Received: 18 September 2023
Accepted: 21 November 2023
Published: 29 December 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s12553-023-00805-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets

Abstract

Purpose

Methods

Results

Conclusion

Similar content being viewed by others

Big-Data Analysis, Cluster Analysis, and Machine-Learning Approaches

Review and Comparative Analysis of Unsupervised Machine Learning Application in Health Care

Hyperparameter optimization for cardiovascular disease data-driven prognostic system

1 Introduction

2 Methods

2.1 Unsupervised machine learning algorithms

2.1.1 Partitioning clustering

2.1.2 Model-based clustering

2.1.3 Hierarchical clustering

2.1.4 Density-based

2.2 Performance comparison measures

2.2.1 Adjusted Rand Index

2.2.2 Adjusted Mutual Information

2.2.3 Homogeneity

2.2.4 Completeness

2.2.5 V-measure

2.2.6 Silhouette coefficient

3 Research dataset

4 Results

5 Discussion

6 Conclusion

Data availability statement

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation