1 Introduction

In recent years, there has been a growing interest in the application of advanced analytics and machine learning to healthcare data. This growing interest has been sparked by opportunities to analyse anonymised health care data and the advancement in the contemporary hardware and software technology [1]. The benefits of healthcare data analytics cannot be underestimated. The access to electronic health records (EHR), population-based registries, disease registries and data from clinical trials can lead to knowledge discovery once the clinical problem is well-defined and the target dataset is analysed in collaboration with clinical teams. The domain knowledge and clinical experience enables researchers to identify knowledge gaps and formulate clinically important research questions which advanced analytics can address during the data science process. Analogous to other fields, the domain knowledge plays a role with a varying degree at every stage of the data science project [2].

2 Aims and objectives

The aim of this paper is firstly to explore how domain knowledge influences the data mining process during a clustering experiment. We chose clustering analysis because this is a popular unsupervised ML used previously to define new ‘subgroups’ of patients with heart failure (HF) [3] [4] [5] [6]. Research has demonstrated that ML can result in improved phenotyping of patients with HF, leading to discovery of new clinical taxonomies [7] and the design of clinical trials testing new treatments combinations that were never used before in specific clusters of patients with HF [8].

Secondly, we will attempt to address the gap which exists between academic data scientists and health practitioners by introducing a framework to enable the integration of domain knowledge into the clustering analysis process. The main motivation to close this gap is to promote effective dialogue between domain experts and theoretical data scientists which will generate mutual gains. The prospect of discovering new patterns that could lead to the development of practical solutions for the medical industry should meet the needs of the domain experts, whereas the opportunity for the publication of the detailed report including data mining process and translation of the experimental work into industry may bring gains to academics. The interaction and effective dialogue between academia and medical industry starts with the mutual respect of the contributions of both parties.

3 Background

Over a decade ago, Cao et al. (2006) [9] identified the differences between data-driven data mining and domain-led data mining process leading to knowledge discovery. The most striking differences in the approaches to data mining were seen between academia and business [10]. According to Cao et al. [10], traditional data-driven data mining is focused on developing innovative approaches with the algorithm at the centre. This approach lets the data create research innovation and perhaps novel algorithms, whilst domain-driven data mining includes humans as a central part of the process and brings solutions to real-world business problems. Following this observation, there was a justified call from the organisers and participants of the 2007 ACM SIKGKDD International Workshop on Domain Driven Data Mining for a paradigm shift from “interesting hidden pattern mining” to “actionable knowledge discovery in varying data mining domains” [11].

A similar pattern of discrepancies in the goals of data mining performed by technologists versus clinicians has been presented by Jasinska et.al. (2021) in the extensive systematic literature of studies using ML on heart failure data sets [12]. There was observed tendency to overclaim the usefulness and applicability of the predictive models to real-world clinical problems, as well as there seem to be an ’unwritten’ race in achieving better than previous authors AUC of the designed model. This systematic literature review provided examples of high-quality data science projects co-authored by domain experts and data scientists resulting in sound predictive models and novel clinical phenotypes [8] [13] [14] [15] [16] [17] [18] [19]. There were studies, where authors affiliated purely with information technology engaged with clinicians (domain experts) at various stages of the data mining process [20] [21] [22]. In most of those studies, clinicians were involved either at the stage of data extraction when the candidate features from data sets were being assessed as suitable for consideration in the prediction model or at interpretation of the results stage [22] [23] [19]. For example, Sun et al. (2012) [22] performed a series of interviews with cardiologists to search for clinically meaningful additional risk factors to be used in their predictive model. Saqlain at al. [20] asked cardiac specialists to make sure that chosen features were sufficient to get valuable results for the model and claimed that by that approach they were provided with a deep knowledge of cardiology and it helped them to understand the domain of the problem.

However, one key result of this review showed that a quarter of included papers (22 papers out of 81) were authored exclusively by researchers affiliated with either computer science, IT, business administration, translational research, health informatics, bio-medical informatics, quantitative health sciences, statistics or a related area [12]. Moreover, none of those 22 papers mentioned the incorporation of domain knowledge in the design and execution of the data science project [12]. This is a concerning pattern, illustrating that in the pure data-driven approach to data mining there was very limited (if any) integration of the domain knowledge into the data science process.

Undoubtedly, in the era of BigData and digital transformation, more than ever before, there is a need for the paradigm shift from the data-driven data mining to domain-driven knowledge discovery, particularly in the field of healthcare data.

4 Structure

In this paper, we explore a domain-led approach and a data-driven approach to performing k-means clustering. We compare the insights deducted from both approaches. During the domain-led approach, the variables (features) used for clustering were agreed and selected by clinical experts in heart failure management. These clinicians are co-authors of this paper (AJP, DM, PC, R. Brisk). The justification for the choice of feature is described in the Methods section. During the second approach, referred to as the data-driven approach, the variables (features) for clustering were determined using a ‘pure’ data-driven feature extraction process using principal component analysis (PCA) to reduce the dimensionality of the dataset whilst maintaining a degree of variance in the dataset. In the Methods section, we describe two approaches to the feature selection stage. In Sect. 5.3, we describe the methods that were used in the domain-led feature selection and clustering experiment, whilst in the Sect. 5.4, we present the data-driven approach to feature extraction and clustering analysis. The Results section is divided into two subsections to provide the results from the two approaches. In the Discussion section, we synthesise the results of both approaches and present the advantages and disadvantages of data-driven and domain-led clustering analysis. We present flowcharts illustrating both of the approaches used in this experiment. We go beyond the synthesis of the results from both approaches and we propose a practical checklist that could be used by data scientists to ensure that domain knowledge is embedded in the data mining project focused on healthcare data.

5 Methods

5.1 Materials

For the purpose of this experiment, we chose an open access heart failure dataset available from the Physionet data repository curated by Zhang et al. (2021) [24] [23]. This dataset was collected with the goal of developing a predictive model for classifying emergency readmission of patients with heart failure using data from electronic health records (EHR). The data were collected during the time period of 2016–2019 in the Sichuan Hospital in China. This dataset contains 2008 instances (i.e. patient cases) and 168 variables (i.e. demographic and clinical features) describing the characteristics of patients with HF. Curators of the dataset, when identifying patients with HF, used the definition of the HF according to the European Society of Cardiology (ESC) [25], i.e. the presence of symptoms and/or signs of HF and the presence of (1) raised Brain Natriuretic Peptide (BNP)>35 pg/mL or NT–proBNP>125 pg/mL or (2) objective evidence of underlying functional or structural cardiac abnormalities evidenced by (3) stress test or (4) invasively measured elevated left ventricle (LV) filling pressure.

The data analysis in this paper was performed using MATLAB (version 2021b), with functions included in the Statistics and Machine Learning Toolbox [26].

5.2 Exploratory data analysis and data pre-processing

There are several limitations of the dataset [23] [24] used for this experiment. The dataset does not offer time series data over the hospitalisation period. Whilst there are 2008 patients in the dataset with 167 variables, there are several missing fields for variables that are considered important from a clinical domain perspective. As shown in Table 1 where we provide the percentage of missing values for selected features, only as little as 32% of patients included in this study, have data about their left ventricular ejection fraction (LVEF). LVEF is an important characteristic that is provided by an echocardiogram, an ultrasound heart scan (ECHO). The severity of the HF is defined by the range of the LVEF. Current international guideliness distinguish three types of HF. This includes, 1) HF with reduced ejection fraction (HFrEF) for LVEF<40%, 2) HF with mildly reduced ejection fraction (HFmrEF) with LVEF between 41–49%), and HF with preserved ejection fraction (HFpEF) with LVEF>50% [27]. LVEF range is used to categorise patients into the type of HF as well as to prescribe appropriate HF treatment. LVEF cut-off points are also used in selecting patients to participate in clinical trials. For the above reasons, we decided to perform the clustering analysis using only the data that contains patient records with known LVEF value and this was available for 635 patients (635 instances in the dataset).

The dataset provides information about the re-hospitalisation due to HF and reports patient mortality. This dataset consists of 50 categorical and 117 numerical variables. Out of 117 numerical variables, only 68 variables had less than 10% missing values for pre-selected patients with known LVEF. Only numerical variables were used in the clustering analysis, because the k-means algorithm performs best on numerical data.

Table 1 Number of missing instances for each variable in the dataset. % out of 2008 instances. LVEF—left ventricle ejection fraction, LVEDD—left ventricle end diastolic dimension, BNP—brain natriuretic peptide, GFR—glomerular filtration rate, CK—creatinine kinase

To assess data distributions, we used visual and statistical methods. Each variable was plotted on a quantile–quantile plot (qqplot), which displays the quantiles of the sample data versus the theoretical quantiles from a normal distribution [28]. On visual inspection, it was clear that the numerical variables did not follow a normal distribution in this dataset. In addition to visual inspection, we used a one-sample Kolmogorov–Smirnov test (p-value< 0.05) to check if the data were normally distributed [29]. The null hypothesis of the normal distribution of the data was rejected for each variable.

Following the findings of the exploratory data analysis (EDA), in the pre-processing stage, we addressed missing values in the dataset by imputing the median value for 10% of missing data points. Due to not normally distributed nature of variables, we used the single imputation technique with median value for missing 10% values of each variable. We normalised the dataset by scaling all feature values to the range 0:1. We did not remove outliers from the dataset as this could potentially lead to the loss of important information about groupings in the data. The final dataset consisted of 635 instances with 68 variables.

5.3 Experiment 1: Domain-led approach to feature selection

The main difference between domain-led approach and the data-driven approach is the feature selection stage. In order to agree on variables to be passed into the k-means clustering algorithm, the clinical co-authors reviewed the variables and decided upon using the following features: “brain natriuretic peptide (BNP)”, “haemoglobin”, “mean corpuscular volume” (MCV), “creatinine enzymatic method”, “sodium”, “albumin” and “left ventricle ejection fraction (LVEF)”. The decision to select these particular features was driven by clinical experience and knowledge of the outcomes of previous randomised controlled trials (RCTs) as well as observational studies in HF [30] [31]. Based on experience, those variables accurately characterise the severity of heart failure. Moreover, it has been shown that these features carry a prognostic value in the course of HF with regard to diagnosis, prognosis, quality of life and hospitalisation.

From the available variables, the clinicians selected brain natriuretic peptide (BNP) as the current standard for diagnosis and monitoring of HF. BNP levels correlate with the New York Heart Association (NYHA) classification of HF. BNP is a test of high specificity and sensitivity. BNP levels greater than 100 pg/mL have a specificity greater than 95% and a sensitivity greater than 98% when comparing patients without HF to all patients with HF [27]. It was a strong argument to choose BNP as one of the features for the clustering experiment.

Haemoglobin was chosen as indicator of anaemia in general and MCV as indicator of iron deficiency anaemia. It is known that up to 50% of patients with HF suffer from iron deficiency [32]. Using the variables “haemoglobin” and “mean corpuscular volume” we could potentially capture the severity of iron deficiency anaemia during the clustering experiment. Iron deficiency anaemia is evidenced by low haemoglobin level, low mean corpuscular volume of the red cell (MCV) and low iron serum level. Iron deficiency anaemia is associated with a lower quality of life, reduced exercise tolerance and increased mortality in HF patients [27]. RCTs (IRONMAN – NCT02642562, AFFIRM-AHF – NCT02937454, FAIR-HF2 – NCT03036462, HEART-FID — NCT03037931, FAIR-HFpEF – NCT03074591) [32] and meta-analyses [33] [34] have demonstrated that intravenous iron supplementation in HF patients with iron deficiency improves symptoms, quality of life and exercise tolerance (as measured by VO2 peak and 6 minute walk test (6MWT)), with an observed trend to reduction of hospitalisation rates. Creatinine was chosen as an indicator for the possible presence of cardio-renal syndrome. During the natural history of HF, patients develop cardio-renal syndrome which is the result of the poor renal perfusion secondary to low cardiac output present in HF. Patients with severe HF continue to develop chronic kidney disease that gradually progresses to irreversible renal failure [35]. Cardio-renal syndrome has a negative impact on HF patients’ outcomes, and the stage of the kidney disease carries a prognostic value [35].

Hyponatraemia (low sodium level) is a strong predictor of the severity of HF and is strongly correlated with increased mortality [30] [31]. Hypoalbuminaemia (low albumin level) is commonly a sign of cachexia—malnutrition—which is frequently described in patients with HF despite normal or above normal Body Mass Index (BMI) [36]. Out of variables obtained by ECHO, left ventricle ejection fraction (LVEF) was selected as the most representative feature to characterise the severity of heart failure. We provided argument and justification for selecting LVEF in Sect. 5.2 on exploratory data analysis and data pre-processing.

In addition to above reasons for selecting specific variables, by having a prior knowledge of variables, clinicians intuitively avoided choosing the variables that are either the ratio of other variables or highly correlated variables. For example ‘INR’ is an International Normalised Ratio, which is derived from prothrombin time (PT) which is calculated as a ratio of the patient’s PT to a control PT standardised for the potency of the thromboplastin reagent. This formula was developed by the World Health Organization [37]:

$$\begin{aligned} \mathrm{INR} = \mathrm{Patient} PT \div \mathrm{Control} PT \end{aligned}$$

Haematocrit is another example of the ratio calculated from the full blood count—it is a ratio between cell concentration-to-blood serum volume ratio.

Some of those variables (not all at once, or in one study) have been commonly chosen in previous studies using machine learning to perform clustering or classification tasks [3] [4] [5] [6]. Amhad et al. (2018) for example used 8 variables with k-means clustering, including the variables: age, creatinine, haemoglobin, weight, heart rate, systolic blood pressure, mean arterial pressure, and income [8].

5.4 Experiment 2: data-driven approach to feature selection

High-quality clustering produces a number of clusters, which are typically characterised by high within-cluster similarity but high between cluster dissimilarity. This objective of high within-cluster similarity whilst maintaining high between cluster dissimilarity is particularly difficult to achieve whilst applying clustering to a highly dimensional dataset. In the data-driven approach, principal component analysis (PCA) was used to reduce the dimensionality of the dataset whilst maintaining a high degree of the variance in the dataset. The final dataset (635-by-68) was passed through a PCA function. PCA used singular value decomposition (SVD) algorithm based on a variance-covariance matrix. The algorithm centred the X (n-by-p matrix) by subtracting column means before computing SVD. PCA used all of the observations for the matrix n-by-p (635-by-68) and returned all 68 principal components. In the results section, we present the scree plot that shows the explained variance of each PC that was studied and we present the loadings that were considered the most important for each principal component.

In next step, we needed to decide how many PCs were to be used for the k-means clustering algorithm. We decided to use the first 22 PCs which were informed by the method described by Tabachnick et al. [38]. It has been suggested that while dealing with moderate size data and a large amount of variables, the number of PCs to be selected for further analysis could be decided based on a simple calculation by taking into account the number of variables. The range was chosen between number of variables divided by 5 and divided by 3. This way the analyst can decide on the optimal number of PCs out of the range: (p/5, p/3), where p is number of variables of the n-by-p data matrix. Tabachnick et al. describes the visual method of deciding upon the number of PCs to be considered; however, this method may not be accurate and might be prone to variability due to the subjectivity of the plot assessment [38].

Holland [39] proposed another method for selecting the principal components with the assumption that all variables contributed the same variance to the PC. In this case, it is recommended to select all PC that are equal or greater than 1/p, where the p is the number of variables used in the dataset. In our experiment, the cut-off point is 1.47. Figure 1 shows all 68 PCs with the horizontal line that is used as the cut-off point to accept the first 22 PCs. Beyond this point, the remaining PCs do not carry high enough loadings in the new environment, and hence they were not taken to the next stage of the experiment.

Fig. 1
figure 1

Scree plot where each PC is represented as a bar in descending order of the percentage of the total variance explained by each principal component. PC after passing dataset consisting of 68 numerical variables for 635 patients. Horizontal line is a cut-off point equalled 1/68 (1.47)

Fig. 2
figure 2

Biplot of 26 variables with highest loadings (loading > 0.32) in each of the contributing to first 22 principal components

To further investigate the first 22 PCs, we used the loadings from the coefficient matrix to identify variables with loadings that are greater than 0.32. The decision to explore eigenvalues with loadings that are greater than 0.32 was supported by the rule of thumb described by Tabachinck et al. [38]. Tabachnick et al. state that variables with loadings greater than 0.32 contribute the most to a given PC. Using this rule of thumb, we identified 26 variables that had the highest contribution to the first 22 principal components. Figure 2 presents a biplot showing these 26 variables.

In the next stage, we used the 22 PCs to pass into the k-means clustering algorithm. Whilst we chose PCs that retained 81% of the variance in the dataset, we were able to reduce the dimensionality from 68 features to new 22 features represented by PCs.

5.5 k-means clustering: optimal number of clusters

To identify the optimal number of clusters for k-means clustering in both approaches, we used a visualisation technique known as the ‘elbow method’ (Fig. 3, 4). We decided to divide the dataset into four clusters in both approaches. To assess the inter-cluster separation, we also used a silhouette criterion and silhouette graph for both methods [40], which confirmed that four clusters were to be optimal number of good quality clusters.

Fig. 3
figure 3

Data-Driven approach. Graph shows the “elbow method” representing number of clusters when the algorithm is applied to 635-by-22 matrix (22 columns are representing first 22 PC). Vertical purple line identifies optimal number of clusters

Fig. 4
figure 4

Domain-led approach. The graph shows the “elbow method” which can be applied to chose the optimal number of clusters when the k-means clustering algorithm is applied to the 635-by-7 matrix (7 columns are representing variables chosen by domain experts to the domain-led clustering experiment). Vertical purple line identifies optimal number of cluster

6 Results

6.1 Results of experiment 1: domain-led approach

Table 2 Domain-led approach. Median value (Minimum - Maximum) of 7 variables selected by domain experts to be used in the k-means clustering of HF cohort
Table 3 Domain-led approach, characteristics of clusters, prevalence of clinical conditions in the cluster

Figure 5 shows a summary of the characteristics of each cluster derived by using the domain-led approach. Table 2 provides median value for each of seven variables used in the clustering experiment. Table 3 provides summary of comorbidities observed in each cohort. Utilising the domain-led approach, we identified the following clusters: Cluster 1 was the cluster with the second most impaired heart function, as per median BNP of 1591 and median LVEF of 39% with the range (17–49%). This cluster was similar to Cluster 2 in terms of the prevalence of kidney disease, with the second lowest prevalence of lung disease.

Cluster 2 was the largest cluster and had the least impaired heart function as defined by the highest median LVEF 59% and the lowest median BNP level (308). This cluster had the highest albumin level and the lowest creatinine level, indicating overall least impaired heart function and least impaired kidney function. In terms of comorbidities, Cluster 2 has the lowest prevalence of myocardial infarction and chronic kidney disease (15%).

Cluster 3 had most severely impaired heart function with the highest BNP 4486 and the lowest range of the LVEF 43% (5–65%). Cluster 3 had the highest prevalence of lung disease, liver disease and dementia.

Cluster 4 is the cluster with the highest prevalence of diabetes and chronic kidney disease. This cluster has the highest prevalence of cerebrovascular disease (stroke) (7.6%).

Fig. 5
figure 5

Domain-led approach. Summary of most distinctive features in each cluster

Fig. 6
figure 6

Data-driven approach. Summary of most distinctive features of each cluster

6.2 Results of experiment 2: data-driven approach

In Tables, 4, 5, 6 and in Fig. 6, we present the most distinctive features that characterise each of the clusters. In the supplementary material, we provide additional Table 9 which breaks down all the remainder of the characteristics for all 4 clusters that resulted from the data-driven clustering approach.

Table 4 Data-driven approach. This table shows the prevalence of medical history documented for patients in each cluster
Table 5 Characteristics of clusters from the data-driven approach. Top 26 variables contributing the most to the first 22 PC passed through the k-means algorithm. Median (Minimum - Maximum) value provided for each cluster
Table 6 Data-driven Approach. This table shows demographic characteristics including age and gender as well as reported symptoms according to NYHA class (New York Heart Association Functional Classification) and Killip classification

Cluster 1 has the highest value of MCV, MCHC, Mean PLT Volume, Eosinophil Count and Eosinophil Ratio. This cluster has the highest median value of Albumin, white globulin, sodium, prothrombin activity and CO2-binding capacity. Cluster 1 BNP and LVEF: the lowest BNP level and the second highest LVEF. In terms of 42 variables that were used to clustering, but with less contribution to the first 22 PC taken into the k-means clustering, Cluster 1 has the highest systolic and diastolic BP, resulting with the highest mean arterial pressure (MAP). This cluster has the highest BMI, lymphocyte count, monocyte count, basophil count, basophil ratio and the lowest platelet count with the highest platelet width distribution (PLT-WD) and the highest chloride. This cluster has the highest glomerular filtration rate (GFR), with the lowest creatinine (68), urea (6.2) and the lowest uric acid (386), and the lowest ALP, direct bilirubin and the lowest globulin. Cluster 1 has the lowest INR, prothrombin time ratio with the lowest high sensitivity troponin. In summary, this cluster has the best kidney function and the best heart function when compared to the other clusters.

Cluster 2 has the lowest prothrombin activity and the lowest CO2-binding capacity. Cluster 2 BNP and LVEF: the highest BNP (3435) and the lowest LVEF (40%) Cluster 2 has the lowest systolic and diastolic BP, with the lowest MAP. This cluster has the highest left ventricle end diastolic dimension LVEDD (62mm). This cluster has the highest urea and uricacod levels with the second highest creatinine level. This cluster has the highest Ddimer level, INR (1.36), APTT, prothrombin time and PT ratio, with the lowest fibrinogen level. IN terms of liver enzymes, Cluster 2 has the highest GGT (59), ALT (34), and the highest total bilirubin (27.2), indirect bilirubin (15.6), and direct bilirubin (10.3). In summary, Cluster 2 has the worst heart function with the highest BNP, LVEDD and the lowest LVEF and the poorest liver function.

Cluster 3 has the lowest albumin (33), the lowest total cholesterol LDL ad white globulin levels. This cluster has the lowest haemoglobin (79) and the lowest MCV (86). Cluster 3 BNP and LVEF: second lowest BNP and the highest LVEF (54%). Cluster 3 has the highest creatinine level with the lowest GFR. This cluster has the lowest red blood cell count, lowest hematoctrit, monocyte count, lymphocyte count and the lowest PLT hematocrit and the lowest PLT -DW. This Cluster has the lowest GGT, ALT, total bilirubin, total protein, triglycerides and HDL. In summary, this cluster comprises patients that have the worst kidney function, with severe anaemia evidenced by low haemoglobin and MCV, with the some stigmata f malnutrition evidenced by the lowest total protein, cholesterol and HDL.

Cluster 4 has the highest haemoglobin, white blood cell count, neutrophil count, total Hb volume, total cholesterol and the highest LDL. Cluster 4 BNP and LVEF: the second highest BNP (667) and the second lowest LVEF (52%) . Cluster 4 has the highest red blood cell count with the highest monocyte count and hematoctrit. This cluster has the highest APTT and thrombin time with the highest fibrinogen. This cluster has the highest globulin level, total protein, triglyceride and HDL. This cluster has mostly moderate values for LVEDD, creatinine , urea, uric acid and GFR. In summary, this cluster has good heart function with normal kidney function.

7 Discussion

This experiment demonstrates that domain knowledge significantly reduces the data dimensionality of the feature set and plays important role in the interpretation of the clustering results.

Our aim was to explore how domain knowledge influences the stages of data science project and how it can help solve challenges posed by domain specific issues. Moreover, by example of this experiment we wanted to bring attention to the need of active involvement of domain experts in data mining process. The goal of this experiment was an attempt to address some of the gaps identified in domain-led data mining process [10].

7.1 Challenges of working with healthcare data

The healthcare sector produces one third of the globally stored digital data; hence, it seems obvious that clinical experts need to be involved in key stages of data science projects to unlock new insights and to integrate layers of clinical knowledge [41]. In the case of the healthcare sector, governments have recognised the need to train the clinical workforce in data analytics to help improve the integration of domain knowledge into the data science process and to prepare clinical teams to embrace the opportunities arising from digital transformation [42]. There is an expectation that clinical teams become skilled in data analytics that is beyond their already acquired knowledge of biostatistics, which is an existing curriculum requirement in medical schools and postgraduate training programmes [42] [43]. The involvement of domain experts is expected at various stages of the data science project. Starting with (1) the problem definition, (2) proposing and curating a target dataset, (3) data cleaning, pre-processing and data transformation, (4) feature set and algorithm selection, (5) the evaluation and interpretation of learned knowledge and suggestion of practical use of the new knowledge to improve processes within the specialist domain.

Skills in the practical application of advanced analytics including artificial intelligence (AI) and machine learning (ML) together with the ability to critically appraise the results will enhance knowledge discovery and provide innovations for the healthcare sector [12]. Education and training in AI and ML will increase the uptake of modern technologies that still suffer from the ‘black box’ stigma. Explainable ML methods must be understandable to the end users, i.e. the clinicians; moreover, clinicians and analysts must use a common language and have a comparable set of analytical skills.

In this experiment, a particular challenge was posed by high dimensionality of the dataset, which is not an uncommon challenge when dealing with healthcare data [44] [45]. We dealt with this issue by using domain knowledge to ‘hand pick’ features to be passed through the clustering algorithm. The advantage of the feature selection performed by domain experts as opposed to feature extraction enabled by algorithms like PCA is the immediate interpretability of the former [46]. In order to assure that the data-driven method could be used by other clinical teams, we aimed to improve the explainability of the data-driven approach to physicians. It was important to present how the PCA algorithm operates and to indicate which variables contributed the most to the PCs passed through the k-means algorithm. When we analysed the make up of the top 22 PCs and once the variables contributing the most were identified, we found out that only six out of seven variables chosen by the physicians were among the 26 variables carrying the highest value in top 22 PCs (LVEF, MCV, haemoglobin, sodium, BNP and albumin). Interestingly, “creatinine enzymatic method” was not amongst variables contributing the most to the top 22 PCs, even though it is felt to be an important variable from a clinical perspective. One possible reason why creatinine did not appear in the top contributing features could be the fact that PCs are aggregates of correlated variables with high variance, whereas in this dataset “creatinine enzymatic method” did not show either high correlation with other variables nor high variance. On a closer look at the reminder of top 26 contributing variables, we noted that total white cell count, eosinophil count, neutrophil count, as well as other variables obtained during the analysis of the full blood count were among these 26 variables. Those indices are produced by the haematology analyser and in most cases they come from a single blood sample. Physicians would be aware that those indices are usually highly correlated; hence, it is not a surprise to see those variables in the top PCs, as all components in PCA represent an aggregation of the correlated variables. The data-driven approach identified 4 clusters and was effective in identifying smaller clusters with strongly distinctive features, in terms of comorbidity and underlying physiology. As a result of the domain-led approach, 4 clusters were identified; however, on a closer look, the clusters had a similar prevalence of the comorbidities. To date, in the literature, we could not find a standard measure for evaluating clustering or standard measures for evaluating unsupervised feature selection methods for clustering [47]. There are, however, some commonly used internal and external measures that can be used for the quality assessment of clusters generated by a clustering algorithm [40]. Clustering solutions can be assessed externally based on how much it resembles a set of classes, commonly known as ground-truth or ‘expert classification’ [40]. This ‘expert classification’ is nothing else but manual tagging with class labels by human experts. As shown in Figure 7, domain knowledge contributed significantly to all stages of the clustering experiment.

Fig. 7
figure 7

This figure shows learning processes used during clustering experiment. The branches on the left (shown in yellow) present stages unique to the data-driven approach, whereas the branches on the right (shown in blue) present stages unique to the domain-led approach

7.1.1 Interpretability of the results

This paper shows that domain knowledge plays an important role in providing an analysis of results obtained through clustering. The knowledge of the physiology, pathology and correlations between a set of variables allowed domain experts to reduce the number of variables to be used in the clustering experiment from 68 to 7. This approach was at risk of bias though, as domain experts opted to use the most commonly used variables to describe the stage of the HF in day to day clinical work. Those variables are well known as they carry prognostic values based on observational studies and RCTs. Such variables have been used for many years in clinical practice and it seemed justified to choose them to best describe clusters of HF patients. The choice appeared straightforward due to the fact that the variables, which are used to describe HF patients, are akin to a distinct language or code that is both universal and understood by clinicians. With a similar ease, we approached the interpretation of the clustering results. Knowing the normal range for all 68 variables by heart, it was a straightforward task to describe clusters of patients and draw conclusions regarding underlying pathological processes. For example, in the case of Cluster 3 from the data-driven approach, this cluster had the lowest median value of haemoglobin, with the lowest MCV, signifying iron deficiency anaemia and features of malnutrition, with the lowest total protein level and albumin. It was not a surprise that in this cluster there have been the highest prevalence of the peptic ulcer disease (11% of the cluster 3) and the highest solid tumour presence (6% of the cluster 3). As clinical domain experts, we would not make the mistake of labelling this cluster of HF patients as “anaemic and malnourished” because due to the ability to contextualise the provided information, we know that peptic ulcer disease in itself, but especially presence of the malignancy—hidden here under the term ‘solid tumour’, can cause the iron deficiency anaemia and can lead to cancer related malnutrition. What is most important though is that we are still aware that ‘correlation is not causation’, and our interpretation may be completely wrong either way. Another important aspect that requires attention when working with datasets of a selected population sample is the risk of bias that could be introduced into the dataset. It is well known to physicians that the prevalence of peptic ulcer disease is significantly higher in South Pacific populations compared to Western populations. Data from the Systematic Investigation of Gastrointestinal Disease in China showed that the prevalence of the peptic ulcer disease was substantially higher in the Shanghai population (17.2%) than in Western populations (4.1%) [48]. Again, domain knowledge proves critical in preventing analysts from drawing conclusions from an unbalanced dataset or a dataset that represents disease prevalence unique to the population in a certain geographic area.

7.1.2 “Actionability” of the results

In the previous paragraph, we discussed the significance of the interpretability of the clustering results. Interpretability is an excellent advantage of the domain-driven approach that risks, however, being lost or skewed in the data-driven approach. Even though the perceived advantage of the objectivity of the data-driven approach may be tempting on using this approach over the domain-led approach, what is important to emphasise is the “actionability” of the clustering results that is strongly linked with interpretability. “Actionability” is a natural byproduct of interpretability and they both should go “hand in hand” during the data mining process. The Domain Driven in Depth Pattern Discovery (DDID-PD) framework proposed by Cao et al. in addition to providing directions on how the domain knowledge should be put on top of the data-driven data mining framework emphasises how the actionability of the data mining can be enhanced. They use the terms of technical and business (domain) interestingness for the purpose of illustrating the process in which the actionable knowledge can be discovered. Authors of the DDID-PD framework, in a form of a mathematical equation, provide a literal prescription for the successful domain-driven data mining, exemplified by the cases of mining actionable correlations in the stock market. Following this framework, the actionable pattern can be discovered whilst two conditions are met: the technical interestingness and business interestingness. The DDID-PD framework captures the essence of the successful domain-driven data mining and is certainly general enough to be applied to other sectors. We see the applicability of the DDID-PD framework to the healthcare sector and the healthcare data. In terms of actionability of the clustering results we would like to suggest the following 3 levels of actionability:

  1. 1.

    Low level action is associated with the discovered taxonomy and labels for the clusters. For many years, taxonomies and classification methods have played a significant role in science and provided frameworks to present knowledge. In practical terms, the use of labels for representing the different types of patients (clusters) could allow new ways for monitoring temporal changes of these clusters/cohorts (in surveillance/epidemiology), for example monitoring the size of those cohorts or other characteristics and be an indicator of the population characteristics of a certain healthcare facility or region.

  2. 2.

    Intermediate level of action could be associated with designing new clinical research protocols looking at specific cohorts of patients derived from data by clustering experiments. Groups of patients with specific features could be studied with respect to the cause of the pathology and potential new treatments.

  3. 3.

    Significant level of action can be implemented by re-designing the healthcare services to enhance the detection of the health condition in specific clusters of patients. Tracking the quality of care, impact on quality of life, comorbidities and mortality statistics of specific cohorts of patients, with frequently occurring health problems and with specific health needs could be used for clinical auditing purposes, as an evidence for the quality improvement interventions, clinical pathways streaming and service re-design.

Table 7 Proposed checklist enabling the integration of domain knowledge into the data mining project. Continuation of the checklist is provided in Table 8
Table 8 Continuation of Table 7—Continuation of the Proposed checklist enabling the integration of domain knowledge into the data mining project

7.2 Importance of the “Domain Knowledge”

Whilst this experiment did not provide any groundbreaking knowledge about HF itself, it is a useful case study demonstrating how domain knowledge can help navigate analysts through a healthcare data mining project. As far back as 2002, Kopnas et al. concluded that “ in terms of the actors involved in the data mining process, domain experts should be in prominent positions within data analysis, data mining, data warehousing and data processing and should actively participate in and guide the process” [2]. Based on available publications [12] and voices of data science experts from the industry [49], the importance of the first pillar of the Cross Industry Standard Process for Data Mining (CRISP-DM) framework, which is a “Business Understanding”, seems to be undervalued.

As a learning point from this experiment, we would like to propose a practical checklist to enhance the engagement of domain experts and the application of the domain knowledge in the data mining project related to healthcare data. It is important within the healthcare industry that analysts have an adequate understanding of the “domain” and that the domain experts (clinicians) help to navigate the direction of the analysis and point towards questions relevant from the clinical perspective.

In Tables 7 and 8, we propose a set of questions that the data scientist should be prepared to address prior to the initiation of the data mining project. This checklist is a result of the collaboration within our team of clinicians and data scientists. We realised that opportunities brought on by advantages in the computational abilities of current software and hardware pose a great temptation to use new machine learning (ML) techniques on healthcare data, especially those available in a public domain. Exploiting new ML techniques on healthcare data may be more effective when performed with the involvement of healthcare expert. Analogous to the trend of a co-design of clinical studies with the involvement of the public and patients representatives’ it would seem natural to talk about the co-design and then the co-production of the domain-driven data mining. We hope that this practical checklist for data scientists will enable better integration of the domain knowledge into the data mining project. In addition to the set of questions in the checklist, we provide an example how our team integrated the domain knowledge during the clustering experiment while working on open source heart failure dataset.

Starting with question 1 of the checklist presented in Tables 7 and 8, it is useful to develop a partnership with clinicians, when working on healthcare data early on in the project. With regards to questions 2–4 it is important to understand the exact domain problem which is to be investigated. It is important to know whether the data science project is a part of a research project, Quality Improvement Project (QIP), Clinical Audit, or service evaluation project. If this project is a part of the research, it will be useful to know commonly occurring questions in health research. They can be grouped into 6 main themes: (1) characterising diseases and describing their natural course, (2) investigating the impact of a disease on the general population as well as assessing correlations between diseases, (3) finding the cause of disease, (4) discovering new treatments or the best treatments combinations out of already existing treatments, (5) assessing the way to deliver the treatment to achieve best result for the patients, (6) learning about the health systems and the costs associated with diseases management.

With regards to question 5, domain experts provide a specific knowledge of the subject and will know aspects related to the data itself. It is important that data scientists leading the project takes an opportunity to find out from domain experts (1) how the data was collected (i.e. was the data manually imputed by clinical staff into database or was it recorded by monitoring devices and automatically saved to patients electronic records), (2) what the data values mean and what is the normal value range (i.e. does the low value of the variable indicates normal state or severe pathology, or in case of time series data, for example does the long history of a certain condition has an impact on long term outcome for the patient and could influence the accuracy of the predictive model if that was objective of the data mining project), (3) the accuracy of the data (is there a risk of error in the data caused by human error whilst data were imputed manually), (4) how to interpret the results of the analysis (i.e. is the result of analysis clinically relevant, do the results make sense to clinicians), (5) the business/domain issues, i.e. could the results alter the current processes in the healthcare organisation.

Mao et al. [50] present factors influencing effective collaboration between teams of bio-medical scientists and data scientists working in the Research IBM. “To find the right answer or to ask the right question?” is the conclusion drawn from interviews with 22 interviewees. It turned out that for bio-medical scientists the original set of questions very early on into the data mining project changes into set of different or “better” questions. This, however, causes a challenge for data scientists who need to adjust to the new “common ground” that is different from the initial “common” ground achieved at the start of the data science project. Mao et al. illustrate the dynamics of the data science project between domain experts (bio-medical scientists) and data scientists. They comment as well on differences of motivations behind the data science project for bio-medical scientists and data scientists. As one of the participants stated, “we are always reproducing predictive models with higher predictive capabilities in the field (...) however we are more interested in what intervention can be done rather than the prediction is accurate” [50]. In addition to detailed analysis of dynamics within the team, they provide an overview of technologies used to enable co-design, communication and collaboration between bio-medical scientits and data scientists (i.e Google Docs, Google Sheets , GitHub, Skype, email, Slack etc.).

7.3 Limitations

Our experiment has limitations, and we will try to address them in future work on larger datasets. To deal with missing values, we used the single imputation method of using the median value for missing variables. Given the fact there was only a small percentage of missing values (only variables with less than 10% missing values were used in experiment) and that these variables did not follow the normal distribution, the single imputation method using the median value would be an appropriate method. Even though Jiang et al. [51] used mean imputation to impute missing data in features prior to performing unsupervised clustering on heart failure dataset, this method may be seen as limitation that that we have not used more sophisticated methods such as kernel density, IDW, K-nearest neighbours to deal with missing values.

In our analysis, we used PCA, which is not free from disadvantages. According to Dormann et al. [52], PCA presumes a multinormal distribution of data and does not cope well with outliers. Due to the nature of the clinical data, we dealt with a dataset that has a multivariate distribution. It has been suggested, however, that in practice, PCA is a relatively robust technique if it is used for continuous variables that are not strongly skewed and does not have many outliers. [52]. In future work, we will explore factor analysis as a method for dimensionality reduction as in contrast to PCA, factor analysis is performed on mutual variance ( i.e. shared variance) of observed variables. In PCA, however, all the variances in the given variables are taken into consideration and contribute to the end result [38]. Another disadvantage of PCA is the fact that all components in PCA are the aggregates of correlated variables and they all co-produce a particular component. However, in the case of factor analysis, a factor carries information about the processes contributing to the production of correlations between variables that contribute to each factor [38]. Another limitation of our study is related to the ML method that we selected to perform the clustering experiment. K-means clustering is a popular method; however, we should perhaps experiment using the k-medoids clustering method. k-medoids is a partitioning method that is best suited for domains requiring robustness to outliers, inconsistent distance metrics, or the dataset with no clear definition of mean or median [53]. The k-medoids algorithm returns medoids which are the actual data points in the dataset. This facilitates the use the algorithm in situations where the mean of the data does not exist within the data set. This is the main difference between k-medoids and k-means where the centroids returned by k-means may not be within the data set. Hence k-medoids is useful for clustering categorical data where a mean is impossible to define or interpret.

8 Conclusions

During this experiment, we demonstrated that the k-means clustering algorithm identified groups of HF patients with distinct features at the physiological level (as evidenced by median blood test results, ECHO findings and clinical observations). The findings at the physiology level were likely to be the accurate reflection of the ‘labels’ given by medical diagnoses as documented for each patient in the dataset. The data-driven approach that utilised PCA seemed more accurate in identifying smaller clusters with distinct features at the physiological level. During this experiment, we compared how domain-led feature selection compares to the data-driven approach. From one perspective, the data-driven approach had the advantage over the domain-led approach for feature extraction as it removed a risk of bias that can be introduced by humans (domain experts). The domain-led approach may potentially prohibit knowledge discovery that can be hidden behind features that are not routinely taken into consideration by physicians as important variables. The domain knowledge played an important role at the interpretation stage of the clustering experiment providing insight into the context and preventing far fetched conclusions. Having carried out this experiment, we have realised the importance of ensuring that the data scientist has appreciated the domain knowledge and fully understood the associated concepts. Therefore the future work may include a framework that would ensure that the data scientist understand the domain knowledge. For example, Delphie Technique could be used alongside a group of experts to form a consensus on what concepts and knowledge would need to be fully understood for data scientist to carry out the domain-led data mining project. Once that consensus is formed, those concepts then can be delivered in a form of training and there will need to be some form of assessment to ensure that knowledge exchange has successfully taken place. This kind of work is much needed because it would provide a consistency across domain-led data mining and would also help reduce the possibility of data scientist misunderstanding concepts and knowledge from the application area. We propose a checklist of questions that should prompt data scientist to actively seek the involvement of domain knowledge expert. This checklist can be further improved and become an agile document, updated when new concepts to enhancing domain-driven data mining arise. Moving forward embedding domain experts in the data analytics process will not only enhance the accuracy of conclusions but will be core to closing gaps in between academic data scientists and clinicians.