1 Introduction

Recently data mining has become an evolving area in information technology. Hundreds of novel mining algorithms and new applications in medicine have been proposed to play a role in improving the quality of healthcare systems. Data mining ties many technical areas, including machine learning, human-computer interaction, databases and statistical analysis. Clinical datasets pose a unique challenge for data mining algorithms and frameworks. These challenges are due to missing values, high dimensionality, unbalanced classes, and various systematic and human errors[1]. Data mining aims to automatically extract knowledge from large scale data. However, information and knowledge mined from the large quantity must be meaningful enough to lead to some advantages. As a result, effective planning of medical care and treatment of patients with heart failure has proved to be elusive.

With the advent of electronic health (patient) records (EHR/EPR)[2,3], large amounts of clinical data have started to become available. However, good, robust, and accurate models for diagnosing and predicting the survivability of patients are not extensively available. Clinical datasets are often extremely complex due to the fact that there are large numbers of variables, and a great deal of missing data and non-normally distributed data. In addition, given the large number of data mining techniques, it can be difficult to decide which technique is required in order to get the correct results from a given dataset. This often means that if the underlying characteristics of the dataset change, the technique must also be changed.

The goal of data mining in health care systems is to assist clinicians in improving the quality of prognosis and diagnosis, and to generate timelines for the medical problem. The target problem was extracted from the dataset using a variety of data mining processes, which were also used to predict mortality and survival time of patients with heart failure. Machine learning techniques, such as supervised and unsupervised methods, were applied to compare the performance of prediction in clinical dataset. This paper looks into a large clinical dataset with a view to understand the underlying properties and the compromises necessary in the selection of methods for data mining. Thus this paper aims not only to explore and select suitable techniques to handle but also to analyse clinical datasets. The clinical dataset to be used is a large heart failure dataset (LIFELAB)[4,5]. Over the years, a large number of results have been presented, specifically dealing with the issue of feature selection and the development of models for heart failure using data mining techniques[628]. A generic process applied here is: 1) missing values imputation, 2) feature selection, 3) classification and 4) clustering. There are a large number of techniques available for feature selection[2931]. Three of these are selected: t-test[32], entropy ranking[33,34], and nonlinear gain analysis (NLGA)[35]. All feature selection methods, indeed dimension reduction techniques, use a feature importance measure capability to select the most relevant features, therefore reducing the dimensionality of the problem. The rationale for this selection is that the three techniques use different properties of the data to select significant features or variables (Here, features and variables are interchangeably used). The t-test method utilizes data distribution as a key property for selecting variables. The entropy method not only uses the distribution, but also includes a measure of data density, and develops a measure for the degree of order in the data. NLGA considers higher weight variables to be more significant based on the artificial neural net input gain measurement approximation (ANNIGMA). ANNIGMA[35] uses neural networks for training large volumes of data and considers higher weight variables to be subset of significant features. The results indicate non-parametric that classifiers, such as decision trees, show a better result when compared to parametric classifiers such as radial basis function networks (RBFN), multilayer perceptron (MLP), and k-means (because these assume that clinical data is normally distributed).

The paper is structured as follows: Section 2 provides some definitions, which are then used later in the paper. Section 3 describes a clinical dataset which has the typical characteristics of many clinical datasets. This section also outlines the embedded characteristics of the dataset, which will prove useful in the analysis of the results. In Section 4, several techniques for data mining are outlined. The category of techniques is dependent on the stage of the data mining process. Therefore, initially methods for imputing missing values are discussed, before moving on to feature selection and classification algorithms. Section 5 analyses the results in the context of the characteristics of the dataset, evaluating and validating the problems associated with the data by establishing a relationship between the complexities, the set of selected features, and the data distribution. The set of appropriate features are those with the highest classification. Section 6 discusses the results in relation and in comparison to previously established findings in literature. Finally, in Section 7 we draw some concluding remarks, summarize the analysed results and specify the further steps of the research as future works.

2 Preliminaries

Let X i XR n; i =1, ⋯, n be the clinical dataset, where n is the number of patient records, and m is the number of attributes (variables). Let x ij R, i = 1, ⋯, n and j = 1, ⋯, m, be the i-th and j-th entry of the dataset under consideration. x ij is defined as the value of the i-th variable for the j-th patient.

Issues associated with the dataset include high dimensionality, incomplete or missing values, and diverse clinical features and their magnitudes. However, many of the features present are irrelevant and redundant. The problem is determining a mapping from the high dimensional space to a lower dimensional space, i.e.:

$$v:\chi \longrightarrow \chi; \chi \in {{\bf{R}}^k};k \ll n$$
(1)

For feature selection, the requirement is that X—since the main interest is to retain the labels associated with the variables. On the other hand, this is not required for feature extraction, since it employs latent variables. (See Fig. 1)

Fig. 1
figure 1

Data distribution of variables in clinical dataset

Definition 1. Subset of selected features (variables/attributes) is selected by dimensionality reduction techniques, the result is the matrix \({{\bar X}_{n \times \bar b}}\).

$${\bar X_{(n \times \bar b)}} \subset {\bar X_{(n \times b)}}$$
(2)

where \(b \gg \bar b,\,b\), b is the number of the original features, \({\bar {b}}\) is the number of the selected features, \({{\bar X}_{(n \times b)}}\) is the data matrix that presents the significant features.

The process of reducing the dimension is essentially one of determining a projection, from the higher dimensional space to a lower dimensional one. Since most projection mappings employ local projections, it is imperative that the matrix A data should not contain missing elements. As such, it is important to define missing data before designing an appropriate imputation method.

$${A_{{\rm{data}}}} = \left[ {\matrix{{{x_{11}}} \hfill & \cdots \hfill & {{x_{i1}}} \hfill \cr \vdots \hfill & \ddots \hfill & \vdots \hfill \cr {{x_{1j}}} \hfill & \cdots \hfill & {{x_{ij}}} \hfill \cr } } \right]$$
(3)

Definition 2. Nullity values are defined as missing values, where values are absent or not recorded for a given attribute. The data matrix x is constructed by x ij , where x ij is null.

$${\rm{nullity}} = \{ {x_{ij}} \in X:{x_{ij}} \in \emptyset \} $$
(4)

Find the numbers of missing value for each column (variable) [N 1, N 2, N 3, ⋯, N m ].

$$[{N_1},{N_2},{N_3}, \cdots, {N_m}] = {\rm{count}}_{j = 1}^m({\rm{nullity}}{(X)_{1, \cdots, n,j}})$$
(5)

(the nullity location of the dataset). The dataset

(6)

where \({\bar \chi }\) is the data matrix shows the location of missing value.

The incomplete, erroneous and noisy data are corrected by imputation. The dataset Ψ(n × m) is the matrix of clinical dataset consists of n records of patient and m variables of attributes. Let x ij R, i = 1, ⋯, n and j = 1, ⋯, m, be the i-th and j-th entry of the dataset under consideration. x ij is defined as the record for each patient.

3 Mining issues in clinical dataset

This study focuses on a heart failure dataset consisting of continuous data, which contains diverse clinical features and numerous subsets, as well as both longitudinal and horizontal data across several generations. The dataset also importantly presents the incidence, prevalence and persistence of heart failure. High-risk patients with heart failure were targeted for evaluation and treatment in a cost-effective manner[26, 36]. The dataset in this paper is a large cardiological database called LIFELAB: A prospective cohort study consisting of 463 variables which are both continuous and categorical, and 2032 patients who were recruited from a community-based outpatient clinic based in the University of Hull Medical Centre, UK. Variables with missing values greater than 20% were excluded to minimize problems during the data mining process. As a result, the number of variables and patients were substantially reduced to 60 variables and 1051 patients. This indicates that the data consisted of multiple missing values that either needed replacement or elimination to allow appropriate analysis and algorithmic implementation. The challenges and complexities in large clinical datasets are discussed in the following outlined topics.

3.1 Incomplete, erroneous and noisy data

There is a wealth of clinical and health records generated every day and kept in storage. This raw clinical data is usually incomplete, containing missing values due to different systematic ways through which the real world data is collected by healthcare practitioners. Clinical datasets almost inevitably contain missing values and misclassified values. Methods of data imputation[37, 38] and missing value replacement are employed to cope with these issues. Inconsistent data can also exist, e.g., when data collection is done improperly or mistakes are made in data entry; the data may also contain error and noise. Commonly, outliers due to entry errors are also found and these were manually inspected to remove irrelevant variables.

3.2 Diverse clinical features and their scales

There are approximately 400 features in the dataset, comprised of many scales of measurement. Some variables consist of integer and decimal values and some scales have a wide range while some have a small range. Normalisation will be applied to solve these problems so that the data elements are within the same scale and manageable for sequential data mining processes.

3.3 Large dimensionality

Large dimensionality is indicated by too many features. Feature selection efficiently copes with this issue. The technique selects meaningful features which can be used in predictive modelling.

The data exploration reveals that the data distribution affects the mining process, including feature selection, classification and clustering analysis. Fig. 1 shows an example of the distribution of variables in the clinical dataset. In theory, the data should be normally distributed. However, it can be seen that this is not the case. It can be seen from Tables 2 and 3 that imputing missing values showed no significant changes and, as a result, the transformation procedure was unable to improve the precision.

4 Data mining processes in heart failure dataset

The mining process that is implemented in this paper can be represented as a four-stage process. The stages are 1) missing values imputation, 2) dimension reduction using feature selection techniques, 3) classification/clustering, and 4) evaluation. In this section, each of these four stages is discussed and the methods are outlined. The data mining framework for handling complexities is outlined in Fig. 2.

Fig. 2
figure 2

The framework for handling complexities in clinical dataset

4.1 Missing value imputation

Data pre-processing is undoubtedly the first step in any form of data analysis and mining of data if the right results are to be obtained[36, 37]. At this stage, any redundant data, irrelevant variables and variables with more than 30% missing data are manually removed[38, 39].

Most datasets encountered contain missing values. Depending on their robustness, machine learning schemes have the ability to handle such datasets. The imputation methods used in this paper are mean imputation, expectation-maximization (EM) algorithm, k-nearest neighbour (k-NN) imputation, and artificial neural network (ANN) imputation[40]. After the application of each of the imputation methods, the data was normalized in order to ensure that all the variables were within the same range so that both data integrity and high performance could be obtained during the mining process.

4.1.1 Mean imputation

A popular method is to use the mean of the data for imputation. Here missing data for a given feature (attribute/variable) is replaced using the mean of all known values of that attribute. However, mean imputation makes only a trivial change in the correlation coefficient and there is no change in the regression coefficient[40, 41].

4.1.2 Expectation-maximization (EM) imputation

Expectation-maximization uses other variables of the dataset to impute a value (expectation) and then checks whether that is the value most likely (maximization) to occur. Here the covariance matrix is estimated, and values to be imputed are generated using this covariance data. This method preserves the relationship with other variables, and is important where factor analysis or regression analysis is applied. As result, EM imputation is one of the most accurate methods of imputation. However, this is a reasonable approach only if the percentage of missing data is very small[42].

4.1.3 k-nearest neighbour imputation

Often, in large data sets it is possible to find two or more records which are similar, but one of them has a particular attribute missing. It is perfectly feasible to use the value from the closest record in similarity to replace the missing value. k-NN imputes missing data by applying this nearest-neighbour strategy[40]. Missing values of a variable are imputed by considering a number of records that are most similar to the instance of interest. In order to determine the similarity of records, a distance function (e.g., Euclidean distance) can be used as a measure.

4.1.4 Artificial neural imputation

ANN is an interconnected assembly of nodes (or neurons)[43, 44] where information or relationships are stored in the interconnections between them in the form of weights. In order to obtain these weights, the ANN has to learn or be trained using a training dataset. This approach can be seen as an extension of the EM approach, where instead of covariance, a nonlinear mapping is obtained to determine the missing values.

These methods were used to impute missing values in the dataset described in Section 3. Table 1 shows some of the variables with approximately 1% to 20% missing values and the results obtained by imputing the missing values. The results shown in Table 1 compare the statistical properties of the data with no imputation and after imputation. It can be seen that with some methods the values of the standard deviation (σ) and mean (μ) have changed. In Table 2, #data indicates the number of data points within the normal distribution range, i.e., data points within the range of [μσ, μ + σ]. It can be seen that missing value imputation methods (EM, k-NN, Mean and ANN) show an increase in the number of data points under the distribution curve. In addition, the table show the effect of imputation methods on the same variable. For example Tables 1 and 2 shows that the imputation method based on k-NN produces the better results for Haemoglobin and Iron, whilst the ANN based method shows the most accurate results for Glucose, vitamin B12 and red cell folate, and that mean imputation is suitable for mean corpuscular volume (MCV). Each of these methods has a specific way of imputing the missing value, and the primary nature of the distribution is either retained by the imputation method or is fundamentally changed. Indeed, this can be seen from Table 2, where the distributions before and after imputation are shown.

Table 1 The statistic of variables before and after missing value handling by different methods
Table 2 Data distribution of different variables of the original data and missing value replacement data

4.2 Feature selection

Feature selection, also known as subset selection, is a process that selects the most relevant attributes (features). This process not only determines the most relevant features, it also reduces the dimensionality of the problem (Fig. 3). Thus reducing the complexity and processing time, while at the same time improving performance. In general, a feature selection algorithm is often composed of three components: a performance function, a search algorithm and an evaluation function. The performance function provides the optimal subsets appropriate for classification. The search algorithm performs the search of an appropriate subset of features. The evaluation function inputs a feature subset and outputs a numeric evaluation.

Fig. 3
figure 3

The dimensionality reduction from a high dimension to a small dimension

Feature selection has been successfully applied to the following datasets: lymphoma, gene expression, cancer[31, 33, 45]. Poolsawad et al.[39] state that feature selection consistently increases accuracy, reduces feature set size, and provides better accuracy for classification. Further, Liu et al.[34] also state that feature selection plays an important role in classification, and is effective in enhancing learning efficiently, increases productive accuracy, and reduces complexity of learning results. In addition, learning is efficiently achieved with just relevant and non-redundant features.

There are two general forms of feature selection procedures: 1) a wrapper model and 2) A filter model[46].

The wrapper model uses the predictive accuracy of a predetermined learning algorithm to determine the goodness of the selected subsets. The learning algorithm is run with various subsets of features, and the learner that performs the best is chosen. In contrast, the filter model presents the data with the chosen subset of features to a learning algorithm. It separates feature selection from classifier learning and selects feature subsets that are independent of any learning algorithm[14, 47]. In comparison to the wrapper model, the filter model is computationally efficient. However, the filter model is known to perform much worse than the wrapper model. A key aspect which needs to be considered when selecting a subset of features is the metrics used for determining the relevance or redundancy of a particular feature. An optimal subset of features should contain a set of robust and relevant features along with a set of weak features[46]. This allows for the selection of features with a positive Z-score[47]. It is possible to obtain different selection of subsets of features depending on the criterion used. Thus the subset obtained using a statistical correlation criterion would be different from when mutual information is used.

4.2.1 Nonlinear gain analysis

Nonlinear gain analysis (NLGA), also known as artificial neural net input gain measurement approximation (ANNIGMA), is a feature ranking procedure[34]. In this approach, a neural network is repeatedly trained. And after each training operation, a set of variables is eliminated based on their effectiveness and significance in predicting the required class or outcome. In the first step, all the features are used as inputs and the network is trained. Once the network has been trained, an ANNIGMA score is determined as

$$L{G_{ik}} = {\Sigma _j}|{w_{ij}} \times {w_{jk}}|$$
(7)
$${\rm{ANNIGM}}{{\rm{A}}_{ik}} = {{L{G_{ik}}} \over {\max (L{G_{ik}})}} \times 100$$
(8)

where i, j, k are the input, hidden, and output layer nodes indicated, respectively. LG ik is the local gain of all the other inputs, while w ij and w jk are the weights between the layers.

Features associated with low ANNIGMA scores are eliminated and another network is trained. This is carried out till such a point that the network performance starts to degrade. The NLGA is a wrapper model and appropriate for handling large datasets with a high dimension. This approach can reduce the dimensions while also maintaining the required accuracy. However, due to its high computational requirements, its application to extremely large data sets is limited.

4.2.2 t-test

Student’s t-test approach uses statistical tools to assess whether the means of two classes that are statistically different from each other by calculating a ratio between the difference of means and the variability of two classes. This method has been found to be efficient in a variety of application domains, for example in: 1) genotype research[31, 33, 47], where the problem is one of evaluating differential expressions of genes from two experimental conditions, and 2) the ranking of features for mass spectrometry[4850] and microarray data[47, 51, 52]. The use of t-test is limited to two class challenges. For multi-class problems, the procedure requires the computing of a t-statistic value (following the equations in [32, 33, 47]) for each feature corresponding to each class by evaluating the difference between the mean of one class and all the other classes, where the difference is standardized by within-class standard deviation as

$$t({x_i}) = {{({{\bar y}_1}({x_i}) - {{\bar y}_2}({x_i}))} \over {\sqrt {\left({{{s_1^2({x_i})} \over {{n_1}}} + {{s_2^2({x_i})} \over {{n_2}}}} \right)} }}$$
(9)

where t(x) is the t-statistics value for the number of features; and \({{\bar y}_1},\,{{\bar y}_2}\) are means of classes 1 and 2, while \(s_1^2,\,s_2^2\) are the within-class standard deviations of classes 1 and 2, n 1 and n 2 are the numbers of all the samples in classes 1 and 2, respectively.

4.2.3 Entropy ranking

While the NLGA approach selects features purely based on their contribution to the final result, and the t-test approach utilizes statistical properties to determine the required features, entropy based approaches not only take into account the statistical properties of the features, but also the compactness and density of the data. Entropy is a measure of the information conveyed by the probability distribution function of a particular variable/feature. Using this entropy, Fayyad[32] suggests a cut-off point selection procedure by using class entropy of subset. In general, if we are given a probability, P(·), then the information conveyed by this distribution, also called the entropy of P, is as

$${\rm{Ent}}(S) = - \sum\limits_{i = 1}^k P ({C_i},S)\log (P({C_i},S))$$
(10)
$${\rm{Ent}}(S) = - \sum\limits_{i = 1}^k {{{{C_i}} \over S}} \log {{{C_i}} \over S}$$
(11)

where Ent(S) measures the amount of information required to specify the classes in a set of attributes S, and P(C i , S) is the proportion of examples in S consisting of class C in the i-th feature. The entropy values are sorted in an ascending order and consider those features with the lowest entropy values.

Table 3 shows the features selected using the ANN imputation and NLGA feature selection technique. The result compares the selected features in both outcomes — mortality (dead/alive) and mortality time frame, and it indicates that the variables highlighted appeared in both outcomes. This signifies that both applied techniques are capable of locating significant variables in the dataset.

Table 3 The selected features using ANN imputation and NLGA

4.3 Classifiers

The classifier algorithms employed in this paper are multilayer perceptron (back-propagation), J48 (decision tree) and radial-basis function (RBF) network. These classification techniques were implemented in Waikaito environment for knowledge acquisition (WEKA)[53].

4.3.1 Multilayer perceptron (back-propagation)

Multilayer perceptrons (MLP) are feedforward neural networks, and are used for learning classification or unknown nonlinear functions[54]. In multilayer perceptron (see Fig. 4), there is an input layer with a node; each node represents an independent variable. There may be one or more intermediate hidden layers, and each node in the output layer corresponds to a different class of the target variable. In this paper, a feed-forward network consisting of input units, hidden neurons and one output neuron is optimized to classify the outcome. The number of input units is the same as the number of input attributes of the selected variables and the number of hidden neurons is half the number of input attributes. All weights are randomly initialized to a number close to zero and then updated by the back-propagation algorithm. The back-propagation algorithm contains two phases: forward phase and backward phase. In the forward phase, we compute the output values of each layer unit using the weights on the arcs. In the backward phase, the weights on the arcs are updated by a gradient descent method to minimize the squared error between the network values and the target values.

Fig. 4
figure 4

A multilayer perceptron structure

The architecture of multilayer perceptron showing the output y, which is a vector with n components determined on the terms of m components of an input vector; x and l components of the hidden layer. The mathematical representation is expressed as

$$\matrix{{{y_i}(x) = \sum\limits_{j = 1}^l {\left[ {{v_{ij}}g\left({\sum\limits_{k = 1}^m {{w_{ij}}} {x_k} + {b_{wj}}} \right) + {b_{vi}}} \right]}, } \hfill \cr {i = 1, \cdots, n} \hfill \cr }$$
(12)

where v ij and w ij are synaptic weights, x k is the k-th element of the input vector, g(·) is an activation function, and b is the bias which has the effect of increasing or decreasing the net input of the activation function depending on whether it is positive or negative, respectively.

In general, MLPs use a supervised training paradigm for determining the weights and to learn the classification problem. MLP learns how to transform input data into a desired response, so they are widely used for pattern classification[55, 56]. In terms of training itself, there are other training paradigms available for these networks, here back-propagation is used for illustration.

4.3.2 J48 (decision tree)

A decision tree partitions the input feature of a dataset into regions, where each assigned label is a value or an action to characterize its data points (Fig. 5). In this paper, a decision tree C4.5 algorithm is generated for classification. The algorithm identifies attributes that discriminates various instances clearly, when a set of items (training set) are encountered. This is performed using a standard equation of information gain. Among the possible values of this feature, if there is any value with no ambiguity, that is, for which the data instances falling within its category have the same value for the target variable, then that branch is terminated and the obtained target value is assigned to it.

Fig. 5
figure 5

Decision tree for predicting the survival months

4.3.3 Radial basis function network

Radial basis function network (RBFN) is an artificial neural network model that uses RBF as an activation function. Fig. 6 presents the architecture of RBFN. It is composed of three layers: an input layer, a hidden layer and an output layer. Each hidden unit implements a radial activation function (a non-linear transfer function) and each output unit implements a weighted sum of hidden unit outputs.

Fig. 6
figure 6

A radial basis function network architecture

The output of the i-th neuron in the output layer of the RBF network is determined as

$${y_i}(x) = \sum\limits_{j = 1}^M {{w_{ij}}} \varphi (||x - {c_j}||),\quad i = 1, \cdots, m$$
(13)

where ϕ(·) is the basis function which is described using xc j , c j is the centre vector for hidden neuron j, w ij is the weight between the node j of the hidden layer and the node i of the output layer, and m is the number of nodes in the output layer. The norm is typically taken to be the Euclidean distance and the basis function is taken to be Gaussian:

$$\varphi (||x - {c_j}||) = {{\rm{e}}^{\left\{ {{{||x - {c_j}|{|^2}} \over {2\sigma _j^2}}} \right\}}}$$
(14)

where ϕ(·) is the width parameter of the j-th hidden unit in the hidden layer.

4.3.4 Support vector machines and random forests

Support vector machines (SVMs)[57] are supervised learning models. SVM’s are essentially a non-probabilistic binary linear classifier and is a model which uses a representation of the key example points which are mapped so that separate categories are divided by a gap that is as wide as possible. New data points are then mapped into the same space and a prediction is made depending on which side of the divide they fall.

The learning in an SVM is the construction of a hyperplane which is used for classification. An ideal or an optimal hyperplane can be defined as a linear decision function which provides the maximal margin between the vectors of the two classes (see Fig. 7). The support vectors define the margin of largest separation between the two classes. SVMs are a popular classification tool as they have excellent generalization properties. However, the training is slow and the algorithms are numerically complex[58]. This paper uses the SVM algorithm called sequential minimal optimization or SMO[58, 59].

Fig. 7
figure 7

A separable problem in a 2-dimensional space[57]

Random forests, as the name suggests, is a collection of trees: decision trees, in this case. Algorithms for classification using a random forests approach was developed by Breiman[60]. Here a combination of tree predictors are used, such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The input class of the random forest for a given input is the mode of the classes predicted by individual trees.

4.4 Clustering

Clustering is a popular multivariate statistical technique embodied in many processes such as data mining, image processing, pattern recognition and classification[61]. The unsupervised method partitions inherent patterns into clusters, based on the order of similarity, thus discovering the structure of a given data. Data points in the same cluster are classified as similar between one another while those in different clusters are dissimilar. In this paper, we have applied two clustering algorithms known as k-means and hierarchical clustering.

Two major issues should be considered in practice: 1) deciding on the number of clusters to use for each clustering algorithm, and 2) defining the categorical attributes[61, 62]. In this study, the number of clusters will be fixed for both algorithms to ensure a fair and consistent analysis, and different categorical attribute are present in the dataset, each representing a different clinical testing. It is important to bear in mind that defining categorical attributes can be a difficult task in cluster analysis[63]. For this reason, the following clustering algorithms are implemented to achieve the best possible clustering outcome based on their respective function.

4.4.1 k-means clustering

k-means clustering is a partition algorithm that organizes the number of objects into k partitions (kn). Where each partition corresponds to a cluster, k and n represents the number of objects. The method assumes that k is fixed[64, 65] and the means in k-means signifies an aggregation of clusters which is usually referred as centroids, as depicted in Fig. 8, denoted as “+”. The centroid based technique ensures objects within the same cluster are similar, and that dissimilar objects are assigned to different clusters. However, this is dependent on the distance between the object and the cluster mean — a new mean must be calculated for each cluster. The process is repeated until a criterion known as the “square-error criterion” is initiated as[66]

$$E = \sum\limits_{i = 1,p \in {C_i}}^k {\sum | } p - {m_i}{|^2}$$
(15)

where E is the sum of the square error for all objects (n) present in the datasets, p and m i are multidimensional this is jointly represented as C i , p represents a given object and the point in space, while m i is the mean of clusters. As a result, the distance between each object to each cluster centre (centroid) marked as “+” is squared and summed. The criterion is an essential part of the k-means process because it compacts and effectively separates the resulting k clusters simultaneously.

Fig. 8
figure 8

Four clusters of the dataset are illustrated

Fig. 9 illustrates k number of clusters in this case, two clusters (A and B). Each object indicated by the bold black dots is distributed to a cluster based on the nearest cluster centre. This is further demonstrated by the dashed circles in A. Based on these objects in the cluster, the mean and distributions are recalculated and redistributed based on the nearest cluster centre and this forms the faded oval shapes shown in cluster B.

Fig. 9
figure 9

A schematic clustering of a set of objects based on the k-means method. The mean or centroid of each cluster are represented by “+”

The structure is characterized by subsets S k I and M-dimensional centroids C k = (c kv ), k = 1, …, k. Subsets S k forms a partition S = {S 1, ⋯, S k } with a set of centroids c = {c 1, ⋯, c k }[44, 67]. Where the M-dimensional centroid vectors (C k ) are cluster centroid that updates the S k cluster list based on the “minimum distance rule”. The rule classes entities to their nearest centroids, this is specifically achieved by computing the distances of each entity i.e., II, to all centroids and then assigned to the nearest centroid.

Sridhar and Sowndarya[68] have shown k-means to produce reliable clustering results, as it is computationally easy and memory efficient. There are two types of k-means explained by Napoleon and Lakshmi[69], namelyen-hanced and bisecting k-means. However, neither are further discussed in this study. Moreover, studies conducted by Steinbach et al.[63] found bisecting k-means to be a better algorithm compared to the standard k-means. Fig. 10 shows three clusters of two distinctive dead and alive classes, alive patients which are represented by the triangulated symbol and the dead patients are represented by the black circles, alive 1 (right) cluster are patients predicted as alive with a few projected towards the dead groups. While Fig. 8 illustrates four clusters grouped into two classes of dead and alive, with dead 1 (left) cluster represented as dead patients.

Fig. 10
figure 10

The narrow passage scenario

4.4.2 Hierarchical clustering

Hierarchical clustering is employed in this study to reveal similarities between the data attributes. The method partitions the data into a division of clusters and points during each stage of the process and then the clusters are combined in a different layer and thus building up a hierarchy of clusters, that resembles a tree diagram. This is presented through the use of a dendrogram.

Hierarchical clustering is generally classified as either agglomerative or divisive. The agglomerative method also known as the “bottom up” approach begins with each observation in their individual cluster and then sequentially merges into groups of larger clusters[44, 70]. The clusters are formed according to the minimum Euclidean distance (also known as a nearest neighbour clustering algorithm) between two objects from different clusters and their similarity are measured based on the closest pair of data points belonging to the different clusters. In contrast, the divisive approach is considered as the “top down” approach—the reverse of agglomerative hierarchical clustering—which begins with all the observations in one cluster and then divides into smaller clusters repeatedly until each observation is assigned to a cluster (Fig. 11). The clusters are divided based on the maximum Euclidean distance principle that considers the closest neighbouring objects in the cluster.

Fig. 11
figure 11

Agglomerative and divisive hierarchical clustering on data objects (A, B, C, D, E)

Fig. 12 demonstrates the relationship and similarities between the variables; and a vertical axis is used to illustrate the similarity scale between clusters. As indicated by the dendrogram, urea and creatinine are the most similar followed by MR-proANP and CT-proET1. This signifies a clear relationship between the variables and correlation values shown in Table 4 which further supports their relation and similarity. Urea and creatinine are linked to CT-proAVP, ferritin while uric acid and red cell folate are also merged together to form one cluster with a similarity scale of approximately 50.

Fig. 12
figure 12

Dendrogram used in hierarchical clustering to illustrate similarities

Table 4 Indicates correlation comparison

4.5 Performance evaluation measures

Performance measures are efficiency to evaluate the performance of classification. Many classifiers based on the performance measures are compared. Thus, we carefully used the measures to evaluate the performance, which are defined as

$$``{\rm{Precision}}^{\prime\prime} = {{TP} \over {(TP + FP)}}$$
(16)
$$``{\rm{Recall}}^{\prime\prime} = {{TP} \over {(TP + FN)}}$$
(17)

where TP is the number of true positives, FP is the number of the false positives, TN is the number of true negatives, and FN is the number of false negatives, respectively. Precision is a function of the correct classified examples (true positives) and the misclassified examples (false positives). Recall is a function of true positives and false negatives. Fig. 13 classifies the relationship between precision and recall values in the dead and alive categories.

Fig. 13
figure 13

A relationship between precision and recall values of classification

5 Experimental results

The experiments aim to assess the performance between supervised and unsupervised method for mining large clinical datasets by using different feature selection and missing value imputation methods. The dataset that used in the experiments is normalised to a range between 0 and 1. In most numerical procedures, such normalization is carried out in order to prevent some attributes with large numeric ranges dominating those with small numeric ranges.

The procedure that used in the experiments follows the framework proposed in Table 5. In all experiments, the data is to be classified into two: mortality (dead or alive) and survival (6, 12, 18, 24, 36, or more than 36 months) (see Table 6). The dataset that is used in these experiments required the data mining process to analyse the data characteristics. The performance of classification (precision and recall) is used to evaluate the performance after applying the different methods for imputing the missing values and for selecting features.

Table 5 The classification results from different missing value replacement methods and feature selection (FS) techniques by dead and alive classes
Table 6 The classification results from different type of missing value imputation methods and feature selection techniques on mortality time frame outcome

It can be seen that the following combination produced the better results using the features shown in Table 4: 1) classification done by the decision tree (Fig. 14). 2) imputation carried out using a neural network and 3) an NLGA for selecting feature.

Fig. 14
figure 14

The classification results from different missing value imputation methods and different feature selection (FS) techniques on 6months class

It can be seen in Tables 1 and 2 that all the imputation techniques, even though imputing different values, resulted in similar classification results (Tables 5 and 6). However, the robust methods, for example EM algorithm, showed better results than others. The reason for this is that the EM algorithm determines maximum likelihood estimates. Tables 1 and 2 show that the statistics (mean and standard deviation) of variables and data distribution before and after applying imputation techniques. The means and standard deviations (Table 1) for EM algorithm are similar to original data. The similarity indicates, that this method provides greater flexibility in the shape of the distribution while maintaining about the same means and standard deviations (Table 2).

Tables 5 and 6 show the differences in the performances between the wrapper and filter approaches to feature selection. It can be seen that NLGA approach provided features which classified the data better than t-test and entropy (Tables 5 and 6). NLGA uses the efficiency of neural network to search for features which satisfies an error criterion. However, in general, wrapper approaches are more computationally intensive than the filter approaches (t-test and entropy). It can be seen from Fig. 14 that for the critical class of 6 month decision trees provide higher precision value than other classifiers.

Amongst the various approaches for classification, RBFN’s and decision tree’s (DT) had a slightly better performance than that of the other classifiers (Tables 5 and 6 and Fig. 14). The basic functions can be advantageous when the data has a multimodal distribution. It is typically trained using a maximum likelihood framework by maximizing the probability (minimizing the error), and hence the model performs a better approximation, and noisy interpolation.

Decision tree is a form of non-parametric multiple variable analysis. This method requires no information on the distribution of data. Decision trees are produced by algorithms that identify various ways of splitting a data set into branch-like segments and can generate rules that are easy to understand. Thus often clinical support systems are developed on the basis of these decision trees[71]. Internally, decision trees used information gain and entropy to select appropriate attributes at each node in order to create the branches.

6 Discussion

It is important to note that the issue of missing values in datasets is a major issue as it affects dimensionality reduction and classification[72]. This paper demonstrates four missing values imputation methods: 1) mean imputation, 2) EM algorithm imputation, 3) k-NN imputation and 4) ANN imputation. The primary reason carrying out imputation is to retain the size of the data rather than reduce it by eliminating record from the datasets. Tables 1 shows the statistical properties are mean and standard deviation, and Table 2 shows the data distribution before and after data imputation. The mean imputation techniques used the population mean of the data variable to replace the missing values, while k-NN calculates the population mean of k-nearest variables. Therefore, both methods produced similar results. The EM algorithm estimates values by using maximum likelihood technique. The EM algorithm results shown in Tables 1 and 2 fall in different distribution to the original distribution while this method can maintain the means and standard deviations. ANN imputation shows an increase in the number of data under the distribution curve. In addition, imputation techniques have shown that they are able to maintain the size of the datasets and also applicable for many data types including categorical and numerical data. It is important to note that imputing missing data with an inappropriate algorithm or technique can lead to biased, invalid or insignificant results. Hence it is vital to select an appropriate method specific for a particular dataset. A rule of thumb could be adopted to visualize the initial distribution of the data if the data is skewed or the data contains high percentages of missing values, then the single imputation method may not be appropriate.

Tables 5 and 6 show the results for various combinations of the imputation methods, feature selection methods and classification methods. It is important to note that the EM algorithm uses the Kullback-Leibler distance (KL)[48], which is also known as relative entropy. Relative entropy defines a distance between two probability distributions, and thus imputes missing values. This process is similar to entropy ranking for feature selection. Results shown in Table 5 indicate that for only two classes, the precision and recall values are similar. However, unbalanced classes, i.e., the distributions of the two classes are not even, pose a challenge in terms of classification accuracy. This is a major issue with most clinical datasets where the observations are based on people with a particular ailment, and a good clinical system is always one where the number of alive patients far out weights the patients who succumb to the ailment. Table 6 shows the results when class of alive patients in further split into 6 classes of mortality months. Comparing the results from the two tables, it can be seen that, non-parametric classifier such as decision tree shows the most significant (precision and recall) results compared to parametric classifiers such as RBFN, MLP and k-means. The key point to note here is that the parametric methods are more suitable for data which is normally distributed. Further, considering one class (6 months) in Fig. 14, the decision tree classifier shows better performance on different feature selection methods and different imputations.

On further analysis of the results, it can be seen that the variables selected using the t-test reduction method, such as triglycerides, potassium, urea/uric acid, creatinine, NT-proBNP and sodium have strong associations with mortality of heart failure[73, 74]. Thus a conclusion can be drawn that this method provides the most suitable set of features. However, the results also indicate that all feature selection algorithms perform equally well; classification accuracy is improved in similar magnitudes. However, the clinical importance of the variables selected would result in a particular method being used. Yu and Liu[46] argue that in theory, more features should provide more power, argue that in theory, more features should provide more power, however, in practice an appropriate subset of features perform well as more features[45].

Feature selection depends on the nature of the distribution of data. The pre-processing step provides information on the data and a better understand of the nature of distribution of the data. This information allows for appropriate feature selection technique to be selected. The clustering algorithms employed in this study have shown that the dataset is structured in an unsupervised manner in order to simplify the process of information retrieval. This finding correlates with works by Bean and Kambhampati[62], where the authors exploited this notion by presenting knowledge extracted from real data in the form of a decision rule set with minimal ambiguity to support and aid in decision making. This was accomplished by employing clustering analysis and rough set theory, also explored the conceptual differences and similarities as well as the link between the two techniques[67].

It is well know that k-means[62] algorithm for clustering and classification has some issues, particularly as the results are dependent on the initial conditions. However, there are methods for selecting the correct initial conditions. In this paper, the method developed by Mirkin[67] has been employed. In this method, the number of clusters, k and number of centroids, c 1, c 2, ⋯ c k are specified initially. Without this initialization, clustering can often produce misleading results as a result of inappropriate final centres and clusters. Mashor[75] suggests that k-means plays an important role in enhancing the performance of RBF, the algorithm determines the centres of the RBF. The location of the centres influences the performance of RBF networks. Obtaining accurate centres is important for RBF networks, for the activation function is dependent on the distance between the data and centres.

Hierarchical clustering suffers from a disadvantage that the quality of the dendrogram can be poor, for example once a merge (agglomerative) or split (divisive) decision has been completed, it is unfeasible to adjust or correct it. Agglomerative is known to perform remarkably slowly for large datasets due to the complexity of O(n 3) where n is the number of objects[76].

7 Conclusions and future work

The methods illustrated in this paper have been applied to a heart failure dataset, and can be applied to various clinical datasets as these datasets present with similar issues. This paper has addressed some of the many challenges presented by clinical datasets. It has also showed how these can be handled using the current methods from statistics and data mining. The first challenge faced is that of missing values (Tables 1 and 2). There are several methods for handling this challenge. Often a preliminary exercise is to[37, 77] discard the variables with a large percentage of missing values, followed by imputing missing values (Tables 5 and 6). An alternative is to ignore missingness by analysing the incomplete data. Imputation techniques are essential if the original size of the dataset is to be retained, and if some useful information is to be extracted. In this paper, techniques for imputing missing values were outlined, these methods produce appropriate values for the missing data.

Table 1 shows the means and standard deviations from different types of imputation methods, these mean values are close to the expected mean value and are in confirmation with the law of large numbers[78]. When the sample size is small, imputation can have a dramatic effect than when the sample size is large.

In the framework (Fig. 1) provided in the paper, indeed in any data mining framework, after the initial pre-processing of the data, reduction of dimensions is almost a necessity. This paper outlined methods for reduction of dimensions. There are a wide variety of methods, which are broadly classified as feature extraction or feature selection. In most clinical applications, feature selection is more appropriate as it retains the variable labels and hence the final model is more meaningful. Features are selected based on a criterion, and often these are based around how effective the features are in performing the task of classification and prediction. In this paper, classification accuracy was selected as the criteria to assess the effectiveness of the feature selection methods. The classifier used were: Multilayer perceptron (back-propagation), J48 (decision tree), RBFN (neural network), SVM and random forest. From the results (Tables 5 and 6) it can be seen that both missing value imputation and feature selection do affect the result. However, the fundamental factor here is to understand the nature of the dataset in order to choose a suitable technique. Another issue that should be noted is the difference between supervised and unsupervised methods in mining of clinical datasets. These datasets have embedded within them numerous complexities and uncertainties in the form of class imbalances, missing values (which could be systematic). Supervised techniques show better results in the form of confusion matrix (precision and recall) than unsupervised techniques such as clustering (see Tables 5 and 6).

This paper has presented a framework for mining of clinical datasets. Currently research is being focused on ways to handle class imbalances within clinical datasets. Often in a clinical setting, the success of the clinic is judged on the number of patients who have recovered from illness and not the number that have succumbed to it. Thus real clinical datasets have a large imbalance, in that the class of live patients would far outweigh the number in the dead class. This imbalance affects imputation, feature selection and classification. Some preliminary results have been obtained and can be seen in [39, 40, 79].