1 Introduction

Machine learning (ML), a subfield of artificial intelligence, leverages computational methods to address challenges using historical data and information without requiring significant alterations to the fundamental process [1]. ML algorithms boast diverse applications, such as automated text classification [2], project analytics [3], spam email filtering [4], marketing analytics [5], and disease prediction [6]. They are primarily of two categories: supervised learning and unsupervised learning, with some researchers also acknowledging reinforcement learning algorithms that learn data patterns to respond to specific environments. Nevertheless, supervised learning and unsupervised learning are the most recognised types. The critical difference between these two categories lies in the existence of labels within the training data subset [1]. Supervised ML relies on labelled data. The dataset includes input features and corresponding output labels, allowing the algorithm to learn a mapping function to make predictions for test data or unseen data [7]. In contrast, unsupervised ML deals with unlabelled data, the dataset only consists of input features but no output labels. This method discovers patterns or clusters autonomously, without direct instructions [8].

The data science research community has recently shown an amplified interest in medical informatics, with disease prediction being a key area of focus [9]. Disease prediction plays a critical role in modern health. It allows for early treatments and improves patient outcomes. ML is a robust tool for predicting disease risk within intricate health data. ML methods can learn from past data to predict future disease risks. Many studies are comparing the performance of supervised ML in the disease prediction domain [10,11,12,13,14].

Nonetheless, there are limited comparative studies on unsupervised ML in the disease prediction domain, as it has not gained as much popularity as supervised ML [9]. Data labels are not always available, particularly in cases where patients have undiagnosed or rare diseases. Vats et al. [15] compared the unsupervised ML techniques for liver disease prediction. They employed DBSCAN (Density-based spatial clustering of applications with noise), k-means, and Affinity Propagation to compare their prediction accuracy and computational complexity. Antony et al. [16] proposed a framework that compares different unsupervised ML methods for chronic kidney disease prediction. Alashwal et al. [17] investigated various unsupervised methods for Alzheimer’s prediction, aiming to identify suitable techniques for patient grouping and their potential impact on treatment. Our research uncovered a gap in research, specifically a lack of thorough comparative studies of unsupervised learning algorithms across various types of disease prediction. As such, this research aims to evaluate the performance of different unsupervised ML algorithms in predicting diseases. It uses a variety of conditions, including heart failure, diabetes, and breast cancer, focusing on employing unsupervised ML techniques, such as k-means, DBSCAN and Agglomerative Clustering for disease prediction. The objective is to compare predictive performance by considering several performance measures, such as the Silhouette coefficient, Adjusted Mutual Information, Adjusted Rand Index, and V-measure. These measures are crucial in identifying the most effective approach for handling different datasets with numerous parameters. The key contributions of this research include:

  • Comprehensive analysis and comparison of various unsupervised ML algorithms for disease risk prediction, using diverse benchmark datasets and performance measures.

  • Identify the top-performing unsupervised ML method for healthcare researchers and stakeholders, which will eventually help select suitable techniques for enhanced disease risk prediction.

2 Methods

ML algorithms are primarily categorised into supervised and unsupervised learning based on the presence or absence of labels within the given data. Supervised learning uses labelled data, while unsupervised learning uses unlabelled data to discover patterns or clusters. This study focuses on different unsupervised learning methods in the disease prediction domain. They are partitioning clustering, model-based clustering, hierarchical clustering and density-based clustering.

2.1 Unsupervised machine learning algorithms

Unsupervised ML, also known as clustering, involves grouping data into clusters based on the similarity of their objectives within the same cluster while ensuring that they are dissimilar to objects in other clusters [8]. Clustering is a type of unsupervised classification since there are no predefined classes.

Figure 1 shows how unsupervised ML techniques classify three groups in a two-dimensional dataset. The dataset consists of 100 randomly generated data points divided into three groups based on their similarity. Different colours represent the clusters. On the scatter plot, the clusters are represented by different circles, and circular bounds have been placed around each cluster to visualise their boundaries better.

Fig. 1
figure 1

An example of unsupervised learning

2.1.1 Partitioning clustering

Partitioning clustering requires the analyst to specify the number of clusters that should be generated. The k-means clustering is the most widely used method of partitioning clustering algorithms [18]. Figure 2 demonstrates the processes for the standard k-means clustering algorithm. The first step involves selecting \(k\) points as the initial centroids. After that, we need to classifying data points based on the distance to the centroids of these \(k\) clusters. Then, recomputing the centroid of each cluster based on classified points and repeating these steps until the centroids do not change. This study uses two popular k-means variants: classic k-means and Mini batch k-means [19].

Fig. 2
figure 2

Demonstration of the standard k-means clustering algorithm

2.1.2 Model-based clustering

Model-based clustering is another unsupervised ML method. It is a probabilistic approach to clustering that uses Gaussian Mixture Models (GMMs) to represent data as a mixture of Gaussian distributions [20]. GMM is a probabilistic model that attempts to fit a dataset to a combination of different Gaussian distributions. It evaluates the likelihood of each data point belonging to each cluster, as opposed to classic k-means clustering, which allocates each data point to a single cluster. This enables a more flexible and accurate representation of data distributions, mainly when dealing with overlapping or non-spherical clusters [20]. Figure 3 shows how a GMM model with four components is fitted to the data, and the resulting clusters are coloured. The GMM’s Gaussian distributions are shown by ellipses, demonstrating each distribution’s spread and direction and the probabilistic character of the clustering process. We also use Bayesian Gaussian Mixture (BGM) [21] for performance comparison.

Fig. 3
figure 3

Demonstration of the Gaussian Mixture Model

2.1.3 Hierarchical clustering

Hierarchical clustering generates a group of nested clusters arranged in a hierarchical tree structure. This can be represented through a dendrogram, a tree-like diagram that documents the sequence of merges or splits [22]. Figure 4 shows an example of a dendrogram for the hierarchical clustering. There are two main types of hierarchical clustering: agglomerative and divisive. Agglomerative clustering is a method that starts with each point as its cluster. As the process progresses, the nearest pair of clusters is merged in each step. This merging continues until it culminates in a single cluster or a specific number of clusters, depending on the parameters set at the outset of the process [22]. On the other hand, Divisive clustering is a method that begins with a single, all-encompassing cluster. As the process evolves, a cluster is split at each step. This splitting continues until every cluster contains only an individual point or a predetermined number of clusters are achieved, depending on the initial setup of the procedure [22]. This study uses both Agglomerative clustering and Divisive clustering for comparison.

Fig. 4
figure 4

Dendrogram for Hierarchical Clustering

2.1.4 Density-based

The density-based method relies on density as the local cluster criterion, such as points connected by density. Characteristics and features of density-based clustering include identifying clusters of any shape. It also effectively handles noise within the data. It requires only a single scan, examining the local region to validate the density. However, it necessitates the specification of density parameters as a condition for termination [22]. Density-based spatial clustering of applications with noise (DBSCAN) is a famous example of a density-based method [23]. This method labels high-density areas as clusters and low-density areas as outliers. It helps discover clusters of varied forms and deal with noise without requiring a set number of clusters [23]. Figure 5 visualises the clusters using the DBSCAN method.

Fig. 5
figure 5

DBSCAN Clustering

2.2 Performance comparison measures

The performance of various unsupervised ML methods is assessed using different evaluation techniques, such as Adjusted Rand Index (ARI), Adjusted Mutual Information (AMI), Homogeneity, Completeness, V-measure, and Silhouette Coefficient. These are applied to establish comparative performance metrics in this study.

2.2.1 Adjusted Rand Index

Adjusted Rand Index (ARI) is a modification to the Rand index. It calculates a similarity metric between two clusters by considering all sample pairs and then counting those pairs that are either similarly or differently assigned in the predicted and actual clusters [24]. The formular for ARI is

$$\mathrm{ARI} = \frac{\mathrm{RI} - \mathrm{Expecte\, RI}}{\mathrm{Max(RI)} - \mathrm{Expected\, RI}}$$

where \(RI\) is the Rand Index, \(Expected \,RI\) is the Expected Rand Index, and \(Max\,(RI)\) is the maximum possible Rand Index.

The ARI value lies between -1 and 1, where 1 means identical clustering and –1 means dissimilar clustering. If the ARI is equal to 0, it indicates random labelling.

2.2.2 Adjusted Mutual Information

Adjusted mutual information (AMI) modifies the Mutual information (MI) score to account for chance [25]. It acknowledges that MI tends to increase with larger clusters, independent of the actual amount of shared information between them. The formula for AMI is

$${\mathrm {AMI}(\mathrm {X,Y})}=\frac{\mathrm{MI}-\mathrm{Expected\, MI}}{\mathrm{Max}(\mathrm H(\mathrm X),\mathrm H(\mathrm Y))-\mathrm{Expected\, MI}}$$

where \(MI\) is the mutual information, \(H(X)\) and \(H(Y)\) are the entropies of X and Y, and \(Expected \, MI\) is the expected mutual information. AMI ranges from 0 to 1, where a score of 1 indicates perfect agreement between two clustering. A score close to 0 suggests largely independent clustering or a result no better than random chance.

2.2.3 Homogeneity

Homogeneity is a clustering measure that compares the outcomes to a ground truth. It denotes that a cluster is homogenous if made up entirely of data points from a single class [26]. The formula for Homogeneity is

$$\mathrm{Homogeneity} = 1-\frac{\mathrm{H}({\mathrm{Y}}_{\mathrm{true}}|{Y}_{\mathrm{predict}})}{\mathrm{H}({Y}_{true})}$$

where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{true}|{Y}_{predict})\) is the conditional entropy of the ground truth given the cluster predictions. \(H({Y}_{true})\) is the entropy of the ground truth. Homogeneity is a metric that also varies between 0 and 1. A score of 1 means that each cluster contains only members of a single class, signifying perfect Homogeneity. A score of 0 indicates that the clusters are randomly assigned, lacking any homogeneity.

2.2.4 Completeness

Completeness is another clustering evaluation metric determining whether all data points in a given class are clustered. The clustering result is deemed complete when each class is contained inside a single cluster [26]. The formula for this measure is

$$\mathrm{Completeness} = 1-\frac{H({Y}_{predict}|{Y}_{true})}{H({Y}_{predict})}$$

where \({Y}_{true}\) is the ground truth, \({Y}_{predict}\) is the predicted clusters. \(H({Y}_{predict}|{Y}_{true})\) is the conditional entropy of the cluster predictions given the ground truth. \(H({Y}_{predict})\) is the entropy of the cluster predictions. Completeness ranges from 0 to 1. A score of 1 is achieved when all class members are assigned to the same cluster, indicating complete capture of all classes within the clusters. A score of 0 would imply that the clustering assignments are completely scattered without capturing the essence of classes.

2.2.5 V-measure

The V-measure is the harmonic mean between Homogeneity and Completeness [26], and the formula is

$$V-measure = \frac{2 \times (Homogeneity \times Completeness)}{Homogeneity +Completeness}$$

The V-measure score lies between 0 and 1, where 1 stands for perfectly complete and homogeneous labelling. V-measure ranged from 0 to 1. A score of 1 represents perfect clustering with both complete capture of all classes within clusters and each cluster containing only members of a single class. A score of 0 would indicate that the clustering fails on both homogeneity and completeness grounds.

2.2.6 Silhouette coefficient

The silhouette coefficient is used in cluster analysis to assess clustering quality. It computes the distance between each data point in one cluster and the points in neighbouring clusters, measuring how well each data point fits into its allocated cluster [27]. The formula is

$$\mathrm{Silhouette}= \frac{{b}_{i} - {a}_{i}}{max({a}_{i}, {b}_{i})}$$

where \({a}_{i}\) is the average distance inside the cluster, and \({b}_{i}\) is the average distance nearest other clusters. Silhouette Coefficient values range from -1 to 1. A score of 1 denotes that the clusters are well apart from each other and clearly distinguished. A score of 0 indicates overlapping clusters. A negative value suggests that data points might have been assigned to the wrong clusters. This metric gives a perspective on the distance and separation between the formed clusters.

3 Research dataset

Table 1 presents the datasets utilised in this study, outlining their respective attributes, including the number of features and data size. These datasets were sourced from the UCI Machine Learning Repository [28] and Kaggle [29]. This research uses the original data without preprocessing to ensure an unbiased comparison. We drop any entries with missing values.

Table 1 A summary of the dataset

4 Results

We employed the default parameters provided by Sklearn for training our unsupervised ML models [44]. Tables 2, 3, 4, 5, 6 and 7 show various models' ARI, AMI, Homogeneity, Completeness, V-measure and Silhouette metrics. They have been trained on our research datasets.

Table 2 Adjusted Rand Index (ARI) comparison among unsupervised machine learning models
Table 3 Adjusted Mutual Information (AMI) comparison among unsupervised machine learning models
Table 4 Homogeneity comparison among unsupervised machine learning models
Table 5 Completeness comparison among unsupervised machine learning models
Table 6 V-measure comparison among unsupervised machine learning models
Table 7 Silhouette comparison among unsupervised machine learning models

Table 2 illustrates the ARI of 15 datasets. Based on these 15 datasets, the best-performing method is the Divisive clustering for D12 (0.8510), followed by BGM for D5 (0.6413). Performance varied widely across methods and datasets, underlining the necessity of testing multiple techniques. For AMI, the highest-performing models changed by dataset (Table 3). For instance, the best performance was observed with Divisive clustering in dataset D12 (0.7504), followed by BGM in dataset D5 (0.5337). For Homogeneity (Table 4), DBSCAN performs remarkably well on eight datasets, while the Divisive clustering performs best on D1 and D12. Regarding Completeness (Table 5), DBSCAN performs best on seven datasets, while BGM and Divisive clustering showed strong results on three datasets. DBSCAN has revealed the best performance on eight datasets with the V-measure metric. Evaluating the Silhouette score, the Agglomerative clustering dominated four datasets.

Additionally, Table 8 illustrates how often each model scored the highest in any given measure. DBSCAN showed the best performance most times (31), followed by BGM (18), Divisive clustering (15) and Agglomerative clustering (14). For individual performance metrics, the DBSCAN is the top performer regarding Homogeneity, Completeness and V-measure. BGM did well against the ARI and AMI metrics. Unsupervised MLs based on k-means (Classic and Mini Batch) showed the most minor performance.

Table 8 Comparison of unsupervised machine learning models showing the number of times they presented the highest measurement

The best model to choose will depend on the particulars of a specific application and the performance indicators that are most important to the stakeholders. From above, the DBSCAN model received the highest score among 15 datasets, demonstrating the best overall performance. However, DBSCAN is sensitive to parameter settings and may struggle with clusters of varying densities, whereas Divisive clustering does not rely on specific parameter settings and is better at handling clusters with different densities. Additionally, unlike Divisive clustering, DBSCAN can face challenges in high-dimensional spaces and in preserving the global structure of data. A critical observation is the wide range of variance in model performance across different datasets, although DBSAN dominated in most cases. This could be reflective of the innate differences in data distribution, noise, and feature relevance. This variation underscores the need for a sophisticated and discerning approach to choosing the appropriate unsupervised ML model, carefully weighing the dataset's unique properties alongside each model's inherent advantages.

Furthermore, the fact that DBSCAN consistently exhibits high Homogeneity, Completeness and V-measure implies that this model is particularly well-suited for datasets where classes are separated by density. This insight could prove invaluable for practitioners dealing with such data characteristics. Conversely, the strong performance of BGM in the ARI and AMI metrics across various datasets indicates its potential as a versatile model capable of capturing the structure of the data with a reasonable balance between cluster purity and recovery.

The Python code used to implement the unsupervised machine learning models considered in this study is available at https://github.com/haohuilu/unsupervisedml/.

5 Discussion

This research compares unsupervised machine learning models applied to eight different health-related datasets. The datasets were sourced from the UCI Machine Learning Repository and encompass a variety of health issues, including heart disease, diabetes, and multiple forms of cancer. These datasets exhibit diverse numbers of features and sizes. The primary goal of this study was to contrast the performance of these models across multiple measures without undertaking any data preparation, ensuring an unbiased comparison.

The performance of the models differed based on the dataset and the evaluation metrics employed: ARI, AMI, Homogeneity, Completeness, V-measure, and Silhouette. Each metric provides a distinct insight into clustering quality. ARI and AMI measure the clustering against the ground truth. Homogeneity evaluates if each cluster solely comprises members of a single class, while Completeness assesses if all members of a specific class are grouped into the same cluster. V-measure is a harmonic mean of these two, and Silhouette gauges cluster separation and cohesion. DBSCAN’s excellent performance in Homogeneity suggests its robustness in capturing dense clusters, but it also flags potential shortcomings in handling data with varying densities or noise levels.

Meanwhile, the BGM model ranks second in overall performance across 15 datasets. It shows notable strength in ARI scores for five datasets and AMI scores for four. However, its high computational requirements might limit its use in large datasets or those needing immediate analysis. BGM models excel at autonomously determining cluster numbers in complex datasets and resist overfitting by integrating prior distributions. However, their high computational demand and reduced effectiveness with non-Gaussian data or inappropriate priors are notable drawbacks [45]. The third best performing model, Divisive clustering, designed for sequencing datasets like life-course histories, leverages Classification and Regression Tree analysis principles, including tree pruning, to predict cluster counts. It excels in hierarchical, large datasets by uncovering complex relationships but can struggle with overlapping or non-hierarchical data, leading to less accurate clustering [46]. Moreover, the consistent performance of Agglomerative Clustering in terms of the Silhouette score suggests its potential utility in datasets where clear separation between clusters is present. Nevertheless, Mini Batch k-means offers an alternative that might better manage noise while sacrificing some degree of performance due to its inherent randomness.

The selection of models in unsupervised learning tasks is nuanced and contextual. For instance, while hierarchical methods like agglomerative and divisive clustering do not require the specification of the number of clusters, their computational intensity and potential to create unbalanced hierarchies must be considered, especially for large datasets. In the literature, the application of unsupervised machine learning models in disease prediction must be judicious, considering the unique characteristics of healthcare data. For example, k-means is known for its efficiency and has been widely used in medical data analysis for its simplicity [47]. However, its performance can be hindered by the requirement to specify the number of clusters and its sensitivity to outliers [48]. DBSCAN is favoured for its ability to find clusters of arbitrary shapes and sizes, which is often suitable for the complex patterns present in medical datasets [49]. Yet, its performance can degrade with varying density clusters. The Gaussian Mixture Model offers flexibility due to its probabilistic nature and can accommodate the varied distribution of medical data [20], though it can be computationally intensive, which may not be optimal for all applications. Experts agree that there is no one-size-fits-all model, and the choice should depend on the specific requirements of the data and the task at hand [50].

To sum up, while DBSCAN frequently emerged as the top performer, no singular model consistently outshone others across every dataset and metric. The choice of model should be influenced by the unique attributes of the dataset and the relevance of the evaluation metrics for the particular research or application context. This study serves as a valuable reference for future unsupervised learning endeavours in health-related fields. It also emphasises the importance of continued exploration in model selection and optimisation techniques. The basic principles, pros and cons of various unsupervised models are detailed in Table 9.

Table 9 Basic principles, Pros and Cons of different unsupervised machine learning models

6 Conclusion

This study comprehensively compared unsupervised learning models within the realm of disease prediction. The diversity of data types within this field, from heart disease to prostate cancer, demands a flexible approach to model selection. Based on the evaluated performance metrics, two models emerged as particularly promising: DBSCAN and BGM. The former demonstrated robust performance in the Homogeneity and V-measure. Conversely, BGM excelled in the ARI and AMI metric. This underscores DBSCAN’s aptitude for discerning densely populated clusters of similarity, even across heterogeneous datasets. Such findings highlight the potential prowess of these models in disease prediction. Their consistently high performance across diverse datasets indicates their capability to transcend the inherent challenges posed by the varied scales and ranges typical of medical data. Despite the intricate nature of medical datasets, these models succeeded in effectively clustering the data. The findings from this study serve not only as a testament to the capabilities of these models in transcending the challenges posed by medical datasets but also as a caveat to the user to be mindful of the models’ limitations. Future research directions could delve into applying deep learning models for predicting disease risks, drawing from an even broader pool of medical datasets. One of the most noteworthy attributes of unsupervised machine learning models is their flexible architecture, which facilitates adaptability and continuous enhancement. It is important to note that unsupervised learning is an evolving domain, and ongoing advancements in algorithm efficiency, model robustness and interpretability are expected to enhance further their application in disease prediction and other medical applications.