Abstract
Purpose
Disease risk prediction poses a significant and growing challenge in the medical field. While researchers have increasingly utilised machine learning (ML) algorithms to tackle this issue, supervised ML methods remain dominant. However, there is a rising interest in unsupervised techniques, especially in situations where data labels might be missing — as seen with undiagnosed or rare diseases. This study delves into comparing unsupervised ML models for disease prediction.
Methods
This study evaluated the efficacy of seven unsupervised algorithms on 15 datasets, including those of heart failure, diabetes, and breast cancer. It used six performance metrics for this comparison. They are Adjusted Rand Index, Adjusted Mutual Information, Homogeneity, Completeness, Vmeasure and Silhouette Coefficient.
Results
Among the seven unsupervised ML methods, the DBSCAN (Densitybased spatial clustering of applications with noise) showed the best performance most times (31), followed by the Bayesian Gaussian Mixture (18) and Divisive clustering (15). No single model consistently outshined others across every dataset and metric. The study emphasises the crucial role of model and performance measure selections based on applicationspecific needs. For example, DBSCAN excels in Homogeneity, Completeness and Vmeasure metrics. Conversely, the Bayesian Gaussian Mixture is good in the Adjusted R and Index metric. The codes used in this study can be found at https://github.com/haohuilu/unsupervisedml/.
Conclusion
This research contributes deeper insights into the unsupervised ML applications in healthcare and encourages further investigations into model selection. Subsequent studies could harness genuine disease records for a more nuanced comparison and evaluation of models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Machine learning (ML), a subfield of artificial intelligence, leverages computational methods to address challenges using historical data and information without requiring significant alterations to the fundamental process [1]. ML algorithms boast diverse applications, such as automated text classification [2], project analytics [3], spam email filtering [4], marketing analytics [5], and disease prediction [6]. They are primarily of two categories: supervised learning and unsupervised learning, with some researchers also acknowledging reinforcement learning algorithms that learn data patterns to respond to specific environments. Nevertheless, supervised learning and unsupervised learning are the most recognised types. The critical difference between these two categories lies in the existence of labels within the training data subset [1]. Supervised ML relies on labelled data. The dataset includes input features and corresponding output labels, allowing the algorithm to learn a mapping function to make predictions for test data or unseen data [7]. In contrast, unsupervised ML deals with unlabelled data, the dataset only consists of input features but no output labels. This method discovers patterns or clusters autonomously, without direct instructions [8].
The data science research community has recently shown an amplified interest in medical informatics, with disease prediction being a key area of focus [9]. Disease prediction plays a critical role in modern health. It allows for early treatments and improves patient outcomes. ML is a robust tool for predicting disease risk within intricate health data. ML methods can learn from past data to predict future disease risks. Many studies are comparing the performance of supervised ML in the disease prediction domain [10,11,12,13,14].
Nonetheless, there are limited comparative studies on unsupervised ML in the disease prediction domain, as it has not gained as much popularity as supervised ML [9]. Data labels are not always available, particularly in cases where patients have undiagnosed or rare diseases. Vats et al. [15] compared the unsupervised ML techniques for liver disease prediction. They employed DBSCAN (Densitybased spatial clustering of applications with noise), kmeans, and Affinity Propagation to compare their prediction accuracy and computational complexity. Antony et al. [16] proposed a framework that compares different unsupervised ML methods for chronic kidney disease prediction. Alashwal et al. [17] investigated various unsupervised methods for Alzheimer’s prediction, aiming to identify suitable techniques for patient grouping and their potential impact on treatment. Our research uncovered a gap in research, specifically a lack of thorough comparative studies of unsupervised learning algorithms across various types of disease prediction. As such, this research aims to evaluate the performance of different unsupervised ML algorithms in predicting diseases. It uses a variety of conditions, including heart failure, diabetes, and breast cancer, focusing on employing unsupervised ML techniques, such as kmeans, DBSCAN and Agglomerative Clustering for disease prediction. The objective is to compare predictive performance by considering several performance measures, such as the Silhouette coefficient, Adjusted Mutual Information, Adjusted Rand Index, and Vmeasure. These measures are crucial in identifying the most effective approach for handling different datasets with numerous parameters. The key contributions of this research include:

Comprehensive analysis and comparison of various unsupervised ML algorithms for disease risk prediction, using diverse benchmark datasets and performance measures.

Identify the topperforming unsupervised ML method for healthcare researchers and stakeholders, which will eventually help select suitable techniques for enhanced disease risk prediction.
2 Methods
ML algorithms are primarily categorised into supervised and unsupervised learning based on the presence or absence of labels within the given data. Supervised learning uses labelled data, while unsupervised learning uses unlabelled data to discover patterns or clusters. This study focuses on different unsupervised learning methods in the disease prediction domain. They are partitioning clustering, modelbased clustering, hierarchical clustering and densitybased clustering.
2.1 Unsupervised machine learning algorithms
Unsupervised ML, also known as clustering, involves grouping data into clusters based on the similarity of their objectives within the same cluster while ensuring that they are dissimilar to objects in other clusters [8]. Clustering is a type of unsupervised classification since there are no predefined classes.
Figure 1 shows how unsupervised ML techniques classify three groups in a twodimensional dataset. The dataset consists of 100 randomly generated data points divided into three groups based on their similarity. Different colours represent the clusters. On the scatter plot, the clusters are represented by different circles, and circular bounds have been placed around each cluster to visualise their boundaries better.
2.1.1 Partitioning clustering
Partitioning clustering requires the analyst to specify the number of clusters that should be generated. The kmeans clustering is the most widely used method of partitioning clustering algorithms [18]. Figure 2 demonstrates the processes for the standard kmeans clustering algorithm. The first step involves selecting \(k\) points as the initial centroids. After that, we need to classifying data points based on the distance to the centroids of these \(k\) clusters. Then, recomputing the centroid of each cluster based on classified points and repeating these steps until the centroids do not change. This study uses two popular kmeans variants: classic kmeans and Mini batch kmeans [19].
2.1.2 Modelbased clustering
Modelbased clustering is another unsupervised ML method. It is a probabilistic approach to clustering that uses Gaussian Mixture Models (GMMs) to represent data as a mixture of Gaussian distributions [20]. GMM is a probabilistic model that attempts to fit a dataset to a combination of different Gaussian distributions. It evaluates the likelihood of each data point belonging to each cluster, as opposed to classic kmeans clustering, which allocates each data point to a single cluster. This enables a more flexible and accurate representation of data distributions, mainly when dealing with overlapping or nonspherical clusters [20]. Figure 3 shows how a GMM model with four components is fitted to the data, and the resulting clusters are coloured. The GMM’s Gaussian distributions are shown by ellipses, demonstrating each distribution’s spread and direction and the probabilistic character of the clustering process. We also use Bayesian Gaussian Mixture (BGM) [21] for performance comparison.
2.1.3 Hierarchical clustering
Hierarchical clustering generates a group of nested clusters arranged in a hierarchical tree structure. This can be represented through a dendrogram, a treelike diagram that documents the sequence of merges or splits [22]. Figure 4 shows an example of a dendrogram for the hierarchical clustering. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering is a method that starts with each point as its cluster. As the process progresses, the nearest pair of clusters is merged in each step. This merging continues until it culminates in a single cluster or a specific number of clusters, depending on the parameters set at the outset of the process [22]. On the other hand, Divisive clustering is a method that begins with a single, allencompassing cluster. As the process evolves, a cluster is split at each step. This splitting continues until every cluster contains only an individual point or a predetermined number of clusters are achieved, depending on the initial setup of the procedure [22]. This study uses both Agglomerative clustering and Divisive clustering for comparison.
2.1.4 Densitybased
The densitybased method relies on density as the local cluster criterion, such as points connected by density. Characteristics and features of densitybased clustering include identifying clusters of any shape. It also effectively handles noise within the data. It requires only a single scan, examining the local region to validate the density. However, it necessitates the specification of density parameters as a condition for termination [22]. Densitybased spatial clustering of applications with noise (DBSCAN) is a famous example of a densitybased method [23]. This method labels highdensity areas as clusters and lowdensity areas as outliers. It helps discover clusters of varied forms and deal with noise without requiring a set number of clusters [23]. Figure 5 visualises the clusters using the DBSCAN method.
2.2 Performance comparison measures
The performance of various unsupervised ML methods is assessed using different evaluation techniques, such as Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, Completeness, Vmeasure, and Silhouette Coefficient. These are applied to establish comparative performance metrics in this study.
2.2.1 Adjusted Rand Index
Adjusted Rand Index (ARI) is a modification to the Rand index. It calculates a similarity metric between two clusters by considering all sample pairs and then counting those pairs that are either similarly or differently assigned in the predicted and actual clusters [24]. The formular for ARI is
where \(RI\) is the Rand Index, \(Expected \,RI\) is the Expected Rand Index, and \(Max\,(RI)\) is the maximum possible Rand Index.
The ARI value lies between 1 and 1, where 1 means identical clustering and –1 means dissimilar clustering. If the ARI is equal to 0, it indicates random labelling.
2.2.2 Adjusted Mutual Information
Adjusted mutual information (AMI) modifies the Mutual information (MI) score to account for chance [25]. It acknowledges that MI tends to increase with larger clusters, independent of the actual amount of shared information between them. The formula for AMI is
where \(MI\) is the mutual information, \(H(X)\) and \(H(Y)\) are the entropies of X and Y, and \(Expected \, MI\) is the expected mutual information. AMI ranges from 0 to 1, where a score of 1 indicates perfect agreement between two clustering. A score close to 0 suggests largely independent clustering or a result no better than random chance.
2.2.3 Homogeneity
Homogeneity is a clustering measure that compares the outcomes to a ground truth. It denotes that a cluster is homogenous if made up entirely of data points from a single class [26]. The formula for Homogeneity is
where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{true}{Y}_{predict})\) is the conditional entropy of the ground truth given the cluster predictions. \(H({Y}_{true})\) is the entropy of the ground truth. Homogeneity is a metric that also varies between 0 and 1. A score of 1 means that each cluster contains only members of a single class, signifying perfect Homogeneity. A score of 0 indicates that the clusters are randomly assigned, lacking any homogeneity.
2.2.4 Completeness
Completeness is another clustering evaluation metric determining whether all data points in a given class are clustered. The clustering result is deemed complete when each class is contained inside a single cluster [26]. The formula for this measure is
where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{predict}{Y}_{true})\) is the conditional entropy of the cluster predictions given the ground truth. \(H({Y}_{predict})\) is the entropy of the cluster predictions. Completeness ranges from 0 to 1. A score of 1 is achieved when all class members are assigned to the same cluster, indicating complete capture of all classes within the clusters. A score of 0 would imply that the clustering assignments are completely scattered without capturing the essence of classes.
2.2.5 Vmeasure
The Vmeasure is the harmonic mean between Homogeneity and Completeness [26], and the formula is
The Vmeasure score lies between 0 and 1, where 1 stands for perfectly complete and homogeneous labelling. Vmeasure ranged from 0 to 1. A score of 1 represents perfect clustering with both complete capture of all classes within clusters and each cluster containing only members of a single class. A score of 0 would indicate that the clustering fails on both homogeneity and completeness grounds.
2.2.6 Silhouette coefficient
The silhouette coefficient is used in cluster analysis to assess clustering quality. It computes the distance between each data point in one cluster and the points in neighbouring clusters, measuring how well each data point fits into its allocated cluster [27]. The formula is
where \({a}_{i}\) is the average distance inside the cluster, and \({b}_{i}\) is the average distance nearest other clusters. Silhouette Coefficient values range from 1 to 1. A score of 1 denotes that the clusters are well apart from each other and clearly distinguished. A score of 0 indicates overlapping clusters. A negative value suggests that data points might have been assigned to the wrong clusters. This metric gives a perspective on the distance and separation between the formed clusters.
3 Research dataset
Table 1 presents the datasets utilised in this study, outlining their respective attributes, including the number of features and data size. These datasets were sourced from the UCI Machine Learning Repository [28] and Kaggle [29]. This research uses the original data without preprocessing to ensure an unbiased comparison. We drop any entries with missing values.
4 Results
We employed the default parameters provided by Sklearn for training our unsupervised ML models [44]. Tables 2, 3, 4, 5, 6 and 7 show various models' ARI, AMI, Homogeneity, Completeness, Vmeasure and Silhouette metrics. They have been trained on our research datasets.
Table 2 illustrates the ARI of 15 datasets. Based on these 15 datasets, the bestperforming method is the Divisive clustering for D12 (0.8510), followed by BGM for D5 (0.6413). Performance varied widely across methods and datasets, underlining the necessity of testing multiple techniques. For AMI, the highestperforming models changed by dataset (Table 3). For instance, the best performance was observed with Divisive clustering in dataset D12 (0.7504), followed by BGM in dataset D5 (0.5337). For Homogeneity (Table 4), DBSCAN performs remarkably well on eight datasets, while the Divisive clustering performs best on D1 and D12. Regarding Completeness (Table 5), DBSCAN performs best on seven datasets, while BGM and Divisive clustering showed strong results on three datasets. DBSCAN has revealed the best performance on eight datasets with the Vmeasure metric. Evaluating the Silhouette score, the Agglomerative clustering dominated four datasets.
Additionally, Table 8 illustrates how often each model scored the highest in any given measure. DBSCAN showed the best performance most times (31), followed by BGM (18), Divisive clustering (15) and Agglomerative clustering (14). For individual performance metrics, the DBSCAN is the top performer regarding Homogeneity, Completeness and Vmeasure. BGM did well against the ARI and AMI metrics. Unsupervised MLs based on kmeans (Classic and Mini Batch) showed the most minor performance.
The best model to choose will depend on the particulars of a specific application and the performance indicators that are most important to the stakeholders. From above, the DBSCAN model received the highest score among 15 datasets, demonstrating the best overall performance. However, DBSCAN is sensitive to parameter settings and may struggle with clusters of varying densities, whereas Divisive clustering does not rely on specific parameter settings and is better at handling clusters with different densities. Additionally, unlike Divisive clustering, DBSCAN can face challenges in highdimensional spaces and in preserving the global structure of data. A critical observation is the wide range of variance in model performance across different datasets, although DBSAN dominated in most cases. This could be reflective of the innate differences in data distribution, noise, and feature relevance. This variation underscores the need for a sophisticated and discerning approach to choosing the appropriate unsupervised ML model, carefully weighing the dataset's unique properties alongside each model's inherent advantages.
Furthermore, the fact that DBSCAN consistently exhibits high Homogeneity, Completeness and Vmeasure implies that this model is particularly wellsuited for datasets where classes are separated by density. This insight could prove invaluable for practitioners dealing with such data characteristics. Conversely, the strong performance of BGM in the ARI and AMI metrics across various datasets indicates its potential as a versatile model capable of capturing the structure of the data with a reasonable balance between cluster purity and recovery.
The Python code used to implement the unsupervised machine learning models considered in this study is available at https://github.com/haohuilu/unsupervisedml/.
5 Discussion
This research compares unsupervised machine learning models applied to eight different healthrelated datasets. The datasets were sourced from the UCI Machine Learning Repository and encompass a variety of health issues, including heart disease, diabetes, and multiple forms of cancer. These datasets exhibit diverse numbers of features and sizes. The primary goal of this study was to contrast the performance of these models across multiple measures without undertaking any data preparation, ensuring an unbiased comparison.
The performance of the models differed based on the dataset and the evaluation metrics employed: ARI, AMI, Homogeneity, Completeness, Vmeasure, and Silhouette. Each metric provides a distinct insight into clustering quality. ARI and AMI measure the clustering against the ground truth. Homogeneity evaluates if each cluster solely comprises members of a single class, while Completeness assesses if all members of a specific class are grouped into the same cluster. Vmeasure is a harmonic mean of these two, and Silhouette gauges cluster separation and cohesion. DBSCAN’s excellent performance in Homogeneity suggests its robustness in capturing dense clusters, but it also flags potential shortcomings in handling data with varying densities or noise levels.
Meanwhile, the BGM model ranks second in overall performance across 15 datasets. It shows notable strength in ARI scores for five datasets and AMI scores for four. However, its high computational requirements might limit its use in large datasets or those needing immediate analysis. BGM models excel at autonomously determining cluster numbers in complex datasets and resist overfitting by integrating prior distributions. However, their high computational demand and reduced effectiveness with nonGaussian data or inappropriate priors are notable drawbacks [45]. The third best performing model, Divisive clustering, designed for sequencing datasets like lifecourse histories, leverages Classification and Regression Tree analysis principles, including tree pruning, to predict cluster counts. It excels in hierarchical, large datasets by uncovering complex relationships but can struggle with overlapping or nonhierarchical data, leading to less accurate clustering [46]. Moreover, the consistent performance of Agglomerative Clustering in terms of the Silhouette score suggests its potential utility in datasets where clear separation between clusters is present. Nevertheless, Mini Batch kmeans offers an alternative that might better manage noise while sacrificing some degree of performance due to its inherent randomness.
The selection of models in unsupervised learning tasks is nuanced and contextual. For instance, while hierarchical methods like agglomerative and divisive clustering do not require the specification of the number of clusters, their computational intensity and potential to create unbalanced hierarchies must be considered, especially for large datasets. In the literature, the application of unsupervised machine learning models in disease prediction must be judicious, considering the unique characteristics of healthcare data. For example, kmeans is known for its efficiency and has been widely used in medical data analysis for its simplicity [47]. However, its performance can be hindered by the requirement to specify the number of clusters and its sensitivity to outliers [48]. DBSCAN is favoured for its ability to find clusters of arbitrary shapes and sizes, which is often suitable for the complex patterns present in medical datasets [49]. Yet, its performance can degrade with varying density clusters. The Gaussian Mixture Model offers flexibility due to its probabilistic nature and can accommodate the varied distribution of medical data [20], though it can be computationally intensive, which may not be optimal for all applications. Experts agree that there is no onesizefitsall model, and the choice should depend on the specific requirements of the data and the task at hand [50].
To sum up, while DBSCAN frequently emerged as the top performer, no singular model consistently outshone others across every dataset and metric. The choice of model should be influenced by the unique attributes of the dataset and the relevance of the evaluation metrics for the particular research or application context. This study serves as a valuable reference for future unsupervised learning endeavours in healthrelated fields. It also emphasises the importance of continued exploration in model selection and optimisation techniques. The basic principles, pros and cons of various unsupervised models are detailed in Table 9.
6 Conclusion
This study comprehensively compared unsupervised learning models within the realm of disease prediction. The diversity of data types within this field, from heart disease to prostate cancer, demands a flexible approach to model selection. Based on the evaluated performance metrics, two models emerged as particularly promising: DBSCAN and BGM. The former demonstrated robust performance in the Homogeneity and Vmeasure. Conversely, BGM excelled in the ARI and AMI metric. This underscores DBSCAN’s aptitude for discerning densely populated clusters of similarity, even across heterogeneous datasets. Such findings highlight the potential prowess of these models in disease prediction. Their consistently high performance across diverse datasets indicates their capability to transcend the inherent challenges posed by the varied scales and ranges typical of medical data. Despite the intricate nature of medical datasets, these models succeeded in effectively clustering the data. The findings from this study serve not only as a testament to the capabilities of these models in transcending the challenges posed by medical datasets but also as a caveat to the user to be mindful of the models’ limitations. Future research directions could delve into applying deep learning models for predicting disease risks, drawing from an even broader pool of medical datasets. One of the most noteworthy attributes of unsupervised machine learning models is their flexible architecture, which facilitates adaptability and continuous enhancement. It is important to note that unsupervised learning is an evolving domain, and ongoing advancements in algorithm efficiency, model robustness and interpretability are expected to enhance further their application in disease prediction and other medical applications.
Data availability statement
This research used openaccess public datasets for research investigation.
References
Alloghani M, AlJumeily D, Mustafina J, Hussain A, Aljaaf AJ. A systematic review on supervised and unsupervised machine learning algorithms for data science. In: Supervised and unsupervised learning for data science. Springer; 2020. p. 3–21.
Chen H, Wu L, Chen J, Lu W, Ding J. A comparative study of automated legal text classification using random forests and deep learning. Inf Process Manage. 2022;59(2):102798.
Uddin S, Ong S, Lu H. Machine learning in project analytics: a datadriven framework and case study. Sci Rep. 2022;12(1):15252.
JáñezMartino F, AlaizRodríguez R, GonzálezCastro V, Fidalgo E, Alegre E. A review of spam email detection: analysis of spammer strategies and the dataset shift problem. Artif Intell Rev. 2023;56(2):1145–73.
Miklosik A, Evans N. Impact of big data and machine learning on digital transformation in marketing: A literature review. Ieee Access. 2020;8:101284–92.
Lu H, Uddin S. A disease networkbased recommender system framework for predictive risk modelling of chronic diseases and their comorbidities. Appl Intell. 2022;52(9):10330–40.
Singh A, Thakur N, Sharma A. A review of supervised machine learning algorithms. In: 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom). Ieee; 2016.
Hahne F, Huber W, Gentleman R, Falcon S, Gentleman R, Carey V. Unsupervised machine learning. In: Bioconductor case studies. Springer; 2008. p. 137–57.
Uddin S, Khan A, Hossain ME, Moni MA. Comparing different supervised machine learning algorithms for disease prediction. BMC Med Inform Decis Mak. 2019;19(1):281.
Katarya R, Meena SK. Machine learning techniques for heart disease prediction: a comparative study and analysis. Heal Technol. 2021;11:87–97.
Rahman AS, Shamrat FJM, Tasnim Z, Roy J, Hossain SA. A comparative study on liver disease prediction using supervised machine learning algorithms. Int J Sci Technol Res. 2019;8(11):419–22.
Shamrat FJM, Asaduzzaman M, Rahman AS, Tusher RTH, Tasnim Z. A comparative analysis of parkinson disease prediction using machine learning approaches. Int J Sci Technol Res. 2019;8(11):2576–80.
Sinha P, Sinha P. Comparative study of chronic kidney disease prediction using KNN and SVM. Int J Eng Res Technol. 2015;4(12):608–12.
Uddin S, Haque I, Lu H, Moni MA, Gide E. Comparative performance analysis of Knearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci Rep. 2022;12(1):1–11.
Vats V, Zhang L, Chatterjee S, Ahmed S, Enziama E, Tepe K. A comparative analysis of unsupervised machine techniques for liver disease prediction. In: 2018 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT). IEEE; 2018.
Antony L, Azam S, Ignatious E, Quadir R, Beeravolu AR, Jonkman M, De Boer F. A comprehensive unsupervised framework for chronic kidney disease prediction. IEEE Access. 2021;9:126481–501.
Alashwal H, El Halaby M, Crouse JJ, Abdalla A, Moustafa AA. The application of unsupervised clustering methods to Alzheimer’s disease. Front Comput Neurosci. 2019;13:31.
Hartigan JA, Wong MA. Algorithm AS 136: A kmeans clustering algorithm. J R Stat Soc Ser C Appl Stat. 1979;28(1):100–8.
Sculley D. Webscale kmeans clustering. In: Proceedings of the 19th international conference on World wide web. 2010.
Reynolds DA. Gaussian mixture models. In: Encyclopedia of biometrics, vol. 741. Springer; 2009. p. 659–63.
Roberts SJ, Husmeier D, Rezek I, Penny W. Bayesian approaches to Gaussian mixture modeling. IEEE Trans Pattern Anal Mach Intell. 1998;20(11):1133–42.
Han J, Pei J, Tong H. Data mining: concepts and techniques. Morgan kaufmann; 2022.
Ester M, Kriegel HP, Sander J, Xu X. Densitybased spatial clustering of applications with noise. Int. Conf. knowledge discovery and data mining; 1996.
Steinley D. Properties of the hubertarable adjusted rand index. Psychol Methods. 2004;9(3):386.
Vinh NX, Epps J. Bailey, J2738784: Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837–54.
Rosenberg A, Hirschberg J. Vmeasure: a conditional entropybased external cluster evaluation measure. Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL); 2007.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Asuncion A, Newman D, UCI machine learning repository. Irvine. USA: CA; 2007.
Kaggle. Kaggle. 2023. www.kaggle.com. Cited 16 June 2023.
Detrano R, Janosi A, Steinbrunn W, Pfisterer M, Schmid JJ, Sandhu S, Guppy KH, Lee S, Froelicher V. International application of a new probability algorithm for the diagnosis of coronary artery disease. Am J Cardiol. 1989;64(5):304–10.
Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. 2020;20(1):1–16.
Smith JW, Everhart JE, Dickson W, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the annual symposium on computer application in medical care. American Medical Informatics Association; 1988.
Mangasarian OL, Street WN, Wolberg WH. Breast cancer diagnosis and prognosis via linear programming. Oper Res. 1995;43(4):570–7.
Machmud R, Wijaya A. Behavior determinant based cervical cancer early detection with machine learning algorithm. Adv Sci Lett. 2016;22(10):3120–3.
Ramana BV, Babu MSP, Venkateswarlu N. A critical study of selected classification algorithms for liver disease diagnosis. Int J Database Manag Syst. 2011;3(2):101–14.
Hong ZQ, Yang JY. Optimal discriminant plane for a small number of samples and design method of classifier on the plane. Pattern Recognit. 1991;24(4):317–24.
Quinlan R. Thyroid disease data set. 1987. https://archive.ics.uci.edu/ml/datasets/thyroid+disease. Accessed 3 Jul 2022.
Soundarapandian P, Rubini L, Eswaran P. Chronic kidney disease data set. Irvine, CA, USA: UCI Mach. Learn. Repository, School Inf. Comput. Sci., Univ. California; 2015.
Lichman M, UCI machine learning repository. Irvine. USA: CA; 2013.
Thabtah F, Kamalov F, Rajab K. A new computational intelligence approach to detect autistic features for autism screening. Int J Med Informatics. 2018;117:112–24.
Mahmood S. Prostate cancer. 2023. https://www.kaggle.com/datasets/sajidsaifi/prostatecancer. Cited 15 Jun 2023.
Patrício M, Pereira J, Crisóstomo J, Matafome P, Gomes M, Seiça R, Caramelo F. Using Resistin, glucose, age and BMI to predict the presence of breast cancer. BMC Cancer. 2018;18(1):1–8.
Fernandes K, Cardoso JS, Fernandes J. Transfer learning with partial observability applied to cervical cancer screening. In: Pattern Recognition and Image Analysis: 8th Iberian Conference, IbPRIA 2017, Faro, Portugal, June 20–23, 2017, Proceedings 8. Springer; 2017.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V. Scikitlearn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.
Moser G, Lee SH, Hayes BJ, Goddard ME, Wray NR, Visscher PM. Simultaneous discovery, estimation and prediction analysis of complex traits using a Bayesian mixture model. PLoS Genet. 2015;11(4):e1004969.
Chander S, Vijaya P. Unsupervised learning methods for data clustering. In: Artificial Intelligence in Data Mining. Elsevier; 2021. p. 41–64.
Jain AK. Data clustering: 50 years beyond Kmeans. Pattern Recogn Lett. 2010;31(8):651–66.
Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the kmeans clustering algorithm. Expert Syst Appl. 2013;40(1):200–10.
Ester M, Kriegel HP, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases with noise. kdd; 1996.
Bouveyron C, BrunetSaumard C. Modelbased clustering of highdimensional data: A review. Comput Stat Data Anal. 2014;71:52–78.
McLachlan GJ, Lee SX, Rathnayake SI. Finite mixture models. Annual review of statistics and its application. 2019;6:355–78.
Ghahramani Z, Beal M. Variational inference for Bayesian mixtures of factor analysers. In: Advances in neural information processing systems, vol. 12. NeurIPS; 1999.
Ackermann MR, Blömer J, Kuntze D, Sohler C. Analysis of agglomerative clustering. Algorithmica. 2014;69:184–215.
Sonagara D, Badheka S. Comparison of basic clustering algorithms. Int J Comput Sci Mob Comput. 2014;3(10):58–61.
Khan K, Rehman SU, Aziz K, Fong S, Sarasvady S. DBSCAN: past, present and future. In: The fifth international conference on the applications of digital information and web technologies (ICADIWT 2014). IEEE; 2014.
Funding
Open Access funding enabled and organized by CAUL and its Member Institutions This research received no specific grant from the public, commercial, or notforprofit funding agencies.
Author information
Authors and Affiliations
Contributions
H.L.: Writing, Data analysis and Research design; S.U.: Research design, Writing, Conceptualisation and Supervision.
Corresponding author
Ethics declarations
Conflict of interests
The authors declare that they do not have any conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Lu, H., Uddin, S. Unsupervised machine learning for disease prediction: a comparative performance analysis using multiple datasets. Health Technol. 14, 141–154 (2024). https://doi.org/10.1007/s12553023008058
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12553023008058