New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection

Nourmohammadi-Khiarak, Jalil; Feizi-Derakhshi, Mohammad-Reza; Behrouzi, Khadijeh; Mazaheri, Samaneh; Zamani-Harghalani, Yashar; Tayebi, Rohollah Moosavi

doi:10.1007/s12553-019-00396-3

New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection

Original Paper
Open access
Published: 03 December 2019

Volume 10, pages 667–678, (2020)
Cite this article

Download PDF

You have full access to this open access article

Health and Technology Aims and scope Submit manuscript

New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection

Download PDF

Jalil Nourmohammadi-Khiarak ORCID: orcid.org/0000-0002-1928-9081¹,
Mohammad-Reza Feizi-Derakhshi²,
Khadijeh Behrouzi³,
Samaneh Mazaheri⁴,
Yashar Zamani-Harghalani² &
…
Rohollah Moosavi Tayebi⁴

4624 Accesses
55 Citations
Explore all metrics

Abstract

The number and size of medical databases are rapidly increasing, and the advanced models of data mining techniques could help physicians to make efficient and applicable decisions. The challenges of heart disease data include the feature selection, the number of the samples; imbalance of the samples, lack of magnitude for some features, etc. This study mainly focuses on the feature selection improvement and decreasing the numbers of the features. In this study, imperialist competitive algorithm with meta-heuristic approach is suggested in order to select prominent features of the heart disease. This algorithm can provide a more optimal response for feature selection toward genetic in compare with other optimization algorithms. Also, the K-nearest neighbor algorithm is used for the classification. Evaluation result shows that by using the proposed algorithm, the accuracy of feature selection technique has been improved.

Improving Heart Disease Prediction Using Feature Selection Through Genetic Algorithm

A comparative analysis of meta-heuristic optimization algorithms for feature selection on ML-based classification of heart-related diseases

Article 03 March 2023

An intelligence method for heart disease prediction using integrated filter-evolutionary search based feature selection and optimized ensemble classifier

Article 05 October 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Heart diseases or cardiovascular diseases (CVD) are one type of diseases that include heart or vessels (veins and arteries). Ten percent of total death in the early twentieth century resulted from heart diseases [1], and the death rate due to these diseases were increased by 25% in the late twentieth century. Heart diseases mainly affect individuals with 65 years old and older, and it has taken the place of infectious diseases as the main reason of death in the world [2]. It was believed that heart diseases are problematic in developed countries, but now they are also expanding in developing countries, since there is no proper health care in these countries.

Recently, World Health Organization has reported that the most common reason of death in the world is CVD. People die from CVD more than any other factor. It has been estimated that 17.1 million individuals died from CVD in 2004, which is 29% of total death in the world. The reason of 7.2 million of these deaths was coronary artery disease that is one of the most common CVD. 5.7 million Of deaths occurred due to cardiac trauma. Countries with low and moderate incomes are affected adversely, 82% of death caused by CVD occurs in these countries, and this rate is similar in both male and female. It is anticipated that until 2030 about 23,600,000 individuals would mostly die from heart diseases and brain stroke. The most increase of death will occur in the eastern Mediterranean region, and most deaths will occur in the Southeast Asia because of changes of life styles, food habits, and occupational culture. Therefore, according to the reports of the WHO, utilizing accurate methods and efficient periodical examination of heart to diagnose heart diseases are very crucial [3].

Heart diseases are the most common reason of death in Iran as well; since, the Iranian ministry of health reported that 46% of death is ceased by heart diseases [4]. The significant growth of these diseases and their complications, and their high costs adversely affect the societies and impose a lot of financial and physical burden on the international community. Therefore, using effective methods to do prevention is very vital.

The construct of a heart disease that includes complete medical evaluation, history taking, and examination with early diagnosis of heart diseases can decrease the mortality rate. One of the best ways to diagnose heart disease is echocardiogram. Echocardiogram or cardiac echo is a painless test using acoustic waves to create images from the heart [5].

However, interpretation of cardiac echo images is a challenge as there is no accurate rule available to infer the data properly. Data analysis of cardiac echo by experts is very time-consuming, and also there are not many experts to do so. Therefore, the approach of automatic cardiac echo interpreter to minimize the human efforts is very important in diagnosis of heart disease. There is a method to extract the hidden data from a huge set of data collected in the past to solve the problems related to diagnosis of diseases. Data mining can make rules from this huge set of data in order to use them in the concept of cardiac echo [6]. Data mining can play an important role to extract data from huge datasets, and it has gained a crucial position in healthcare, recently. The process of extraction is related to classification, clustering, and relationship law discovery, instead of data analysis.

Two datasets are used in the paper, namely; the UCI Machine Learning Repository and Tehran Shahid Rajaei hospital which have 303 samples and 13 features [7] and 303 random patients with 59 features respectively.

When there are features in data, some of them are not useful even though they cause to get bad result therefore, the main aim of this study is to use a combined method to improve the classification and better feature selection which will lead to better diagnosis of heart disease. In this study, imperialist competitive algorithm with meta-heuristic approach is used to optimize the selection of important features in heart disease. This algorithm can provide a more optimal response for feature selection towards genetic and other optimization algorithms. After feature extraction, the features have been supplied to K-nearest neighbor (KNN) for classification proposes. Therefore, using the combination of these two methods can lead to improve the result of heart disease diagnosis and their different aspects. In other words, we are trying to improve classification accuracy on heart disease diagnosis. Using ICA and KNN was our idea which has never been done before. Moreover, according to the result section 4 the proposed method has achieved better result in comparison with other algorithms, which has had two advantages, first decreasing number of features, second, increasing classification accuracy. Objective of this research are as following:

Data collection from new features about the heart disease.
Prediction and classification of incidence of heart disease using the proposed method.
Using new feature selection algorithms for the first time.
Providing a new combined approach with higher accuracy.

This paper consists of four sections. The first section is introduction of topic, and introducing the aims of the study, steps needed to achieve these aims, and the results of the previous related studies in the field of heart disease. In the second part the studies about diagnosis of CVD using data mining are assessed. The third part explains the suggested methods in detail. In this section, first the suggested method and its required data are explained, and then it is assessed for heart disease using standard criteria. The last section explains the conclusion and suggests some works about the things that can be done to continue the optimization of accuracy, and feature selection.

2 Related works

There are a lot of medical data about different kinds of diseases nowadays. Medical centers collect these data for different purposes. One of the aims of using these data is study to obtain useful results and approaches about the diseases. The huge size of these data leads to confusion and impedes achieving appropriate results. A lot of studies have been performed about the data mining techniques for cardiovascular patients, which applied different methods such as decision tree and neural networks. Based on the wide spending heart disease and its high costs, some studies have searched for solutions for prevention, efficient early treatment with lower costs, and as a result decreasing the number of tests.

In [8] a method has been presented to diagnose the heart disease using Particle Swarm Optimization as well as Neural Network Feed Forward Back-Propagation. The main focus was decreasing the number of features and costs. Also, in [9], a Naïve Bayes classification method was used based on the patients’ history. In this study, first contemporary community rules were extracted to preprocess the data in order to obtain high quality patterns. Then, a pattern recognition algorithm was presented to identify contemporary community rules by the identification of most of temporal relationships in temporal abstractions (TAs). Finally, periodical contemporary community rules were combined as the feature for classification.

In [10] decision tree used for data mining in heart disease. One of the objectives of this research is to extract the hidden knowledge from huge datasets of heart disease in order to create a predictor model for heart disease using decision tree. Dataset that is used in this study consisted of 2346 unique samples, which included 1159 healthy individuals and 1187 patients with heart disease.

Feature selection [27] and removing the effects of redundant features in heart diseases can be a good solution to diagnose the disease. In this regard, [11] used medical tests as input in order to present a method, then a set of features are extracted by decreasing dimension, and a diagnosis system for heart disease is provided. This study has used Probabilistic Principal Component Analysis to extract prominent features with high effects. Then, Support Vector Machines (SVM) classifier that is based on the radial basis function (RBF) is used to classify data.

In [12], researchers tried to apply data mining methods to diagnose heart diseases. Different classification methods such as neural networks and decision tree are used to predict heart disease and to identify of its most important factors. Whereas, the discovery of relationship law to identify the effects of diet, life style, and environment on heart disease is used in this study. Clustering algorithms such as k-means algorithm have been used in the datasets of heart diseases which includes clinical data screening of the patients in order to diagnose sample, especially heart attack.

In [13], a combination of decision tree and Bayes network is used. The methodology that has been presented in this study in order to predict coronary disease included understanding and selection of the goal dataset, preparation and normalization, data mining and evaluation, and conclusion and the use of knowledge. Each step has some subsets. First, the samples are selected from patients’ files in order to randomly predict, and the fields or the features of samples are identified and extracted based on the opinions of experts. In second stage of methodology, some new fields that are not directly mentioned in the patients’ files are cleaned, normalized, and calculated. In the 3rd stage, the aim and data mining responsibility that is the use of prediction techniques in the mentioned research is determined. Then, based on the aim the mentioned algorithms are selected in order to predict. The algorithms used in this study included Bayesian network and decision tree as they are more perceptible for the experts. In the 4th stage, results are evaluated and accuracy and precision of the prediction method are assessed. The selected method for prediction is Bayesian network that describes the conditional relationship between variables. This network can create a probable model of the variables in order to determine the probability of occurrence of the feature sets.

In [1], Jabbar used the combination of Genetic algorithm and K-nearest neighbor. The assessment of the previous studies shows that the present methods have decreased the difficulties presented in the field of heart disease, however, feature selection with high accuracy and classification of them is a big challenge. In this study the authors tried to diminish the difficulties using combined methods.

3 Method

A data mining method is presented using a combination of imperialist competitive algorithm and K-nearest neighbor for the selection of feature and classification of heart disease data, which the combination of these two methods reaches to an optimized method. This process is shown in Fig. 1.

The proposed method is started with feeding input training data. Then the data are normalized. It is proven [14] that, in comparison with other methods, Z-score has been normalized heart data in the best way. Because it maintain range (minimum and maximum) and variance and standard deviation are introduced as a dispersion of the series by it. After that, ICA start to do feature extraction which is described section 3.1. It should be mentioned that, KNN are used as a fitness function inside ICA algorithm. Therefore ICA alone with KNN has made a combination method to classify the input data. There is no mathematic formula for the proposed method because it is combination of two already existing method and they are described in detail further.

The combination of these two methods in this study is more efficient in compare previous works. First, in order to perform the suggested method the data are loaded, and then imperialist competitive algorithm is used to select the appropriate features from the loaded data, which can classify them. In the following section it is completely explained that how imperialist competitive algorithm with meta-heuristic approach will operate.

3.1 Imperialist competitive algorithm for feature selection

Imperialist Competitive Algorithm (ICA) is used to select features in the diagnosis of heart disease. In this study, it is assumed that the number of features is specified, and the aim is just to find the best features that can increase the accuracy of diagnosis of heart disease. The number of the selectable features in the implemented tests is assumed to be equal to different datasets. Similar to other developmental optimization methods, imperialist competitive algorithm also begins with an initial population, which each member of the population is called a country. These countries are divided into two groups; the countries that are colony that are subordinated to a country, and colonialist countries, i.e. they are dominating some colonies. Each colonialist country dominates colonies based on their power, and finally the most powerful country is selected as the optimized point in optimization problems [15]. Figure 1 shows the proposed method.

Indeed, initialization is the determination of the initial population to start the optimization algorithm. Optimization problems mostly look for an optimized answer based on the problem’s variables. Also, in this algorithm an array of variables of the problem that will be optimized are created. The created array in the imperialist competitive algorithm is called a country. A country is in an array of 1 ∗ N_var. The magnitudes of the variables are shown in decimal form. The members of a country for heart disease in this algorithm can be considered as the features in this disease. A better understanding of this problem can be achieved from Fig. 2.

To design the features of heart disease with the highest accuracy in diagnosis, the obtained answers will have a stable output. For this problem a set of initial answers may be obtained, and it defines a country as follows:

$$ Countr{y}_i=\left[ Ag{e}_i, Gende{r}_i, Clostru{l}_i\right] $$

(1)

To use these countries as the initialization of the algorithm, some initial countries have to be created. Therefore, the matrix of all countries is randomly formed by random magnitudes. The cost of a country in this algorithm is calculated by the evaluation of function f in variables (p₁, p₂, …, p_n) as follows:

$$ Cos{t}_i= Fitness\left( Countr{y}_i\right)= Fitness\left({p}_1,{p}_2,\dots, {p}_n\right) $$

(2)

Indeed, the cost function in this study is the accuracy of the classification of heart disease data, which the aim is to maximize the accuracy of the classification. To calculate the costs of a country, each feature of the classifier data is assumed as test data and the educational data. Then the accuracy or the cost function is calculated using these data. The sum of the answers of ErrorValue = 1 − (count/n + Eppsilon);is calculated as the cost of the country.

The best country (the best features with the highest accuracy) is what this study was looking for. The algorithm used in this study made an initial dataset and classified them into the form of the empires, and colonialist countries applied assimilation policy on the colonies and imperialist competitive is applied among the empires in order to find the best country. To start the imperialist competitive algorithm for heart disease, N_countryinitial countries with random magnitudes were formed. The number of the best members of the population is shown byN_imp, and this member is selected as the emperor. The number of other countries (colonies) is shown byN_col. Each imperialist took some colonies based on their power. In this regard, after the calculation of the costs of all imperialists, the normalization cost of them was calculated using the following equation:

$$ {C}_n=\underset{i}{\max}\left\{{c}_i\right\}-{c}_n $$

(3)

Equation 3 shows the power and the cost of an emperor. c_n is the cost obtained for imperialist n. $ \underset{\mathrm{i}}{\max}\left\{{\mathrm{c}}_{\mathrm{i}}\right\} $ is the maximum cost among all imperialists. And c_n is the cost spent for normalization of this imperialist. The more imperialist costs will have the less normalization costs. Colonies are divided among imperialists using the following equation:

$$ {P}_n=\left|\frac{C_n}{\sum_{i=1}^{N_{imp}}{C}_i}\right| $$

(4)

$ {\mathrm{P}}_{\mathrm{n}}=\mid \frac{{\mathrm{C}}_{\mathrm{n}}}{\sum \limits_{\mathrm{i}=1}^{{\mathrm{N}}_{\mathrm{i}\mathrm{mp}}}{\mathrm{C}}_{\mathrm{i}}}\mid $In the other words, the ratio of the colonies of a normalized power is an imperialist that is directed by that imperialist. Therefore, the initial number of the colonies for an imperialist is calculated using the following equation:

$$ N.C{.}_n=\left|\mathrm{Round}\left\{{P}_n.\left({N}_{col}\right)\right\}\right| $$

(5)

In this equation, N. C._n is the initial number of the colonies, and N_col is the total number of the colonies. N. C._n is the number of the initial colonies that are selected randomly and are given to the imperialist n. The imperialist competitive algorithm starts after obtaining the initial form of all empires. The developmental process is in a cycle that will continue until meet a suspension condition. The power of the empires is calculated using the following equation:

$$ \mathrm{T}.{\mathrm{C}}_n=\mathrm{Cost}\left({\mathrm{imperialist}}_n\ \right)\ \mathrm{mean}\left\{\mathrm{C}\mathrm{ost}\left(\mathrm{colonies}\ {\mathrm{of}\ \mathrm{mpire}}_n\right)\right\} $$

(6)

During colonial competitive the weak empires lose their power gradually and then they are eliminated, and only one empire will be left that will rule the world and manage it. When this happens, the imperialist competitive algorithm stops as it reaches to the optimized point of the goal function. In this algorithm, sometimes the power of an imperialist becomes less than one of its colonies, the power of an empire is the set of the power of the imperialist plus some percentage of the total power of its colonies. The cost of an emperor is calculated using the following equation:

$$ T.C{.}_n=\left|\mathrm{Cost}\left({\mathrm{imperialist}}_n\right)+\partial mean\left\{ cost\left( colonies\ of\ empir{e}_n\right)\right\}\right| $$

(7)

In this equation, the total cost of empire n is shown by T. C._n and ξ is a positive magnitude between 0 and 1. If ξ is a small magnitude, the total cost of an empire goes toward its own costs. If ξ increases, the effect of the costs of other colonies becomes higher than the cost of the empire. Generally ξ = 0.05 leads to appropriate results. In this study ξ is equal to 0.1.

During the ongoing competition in the algorithm, if an empire cannot increase its power, it will go toward non-optimized answers and it will lose its colonies, therefore gradually will be eliminated from the competition. Then, other empires will take control of the eliminated empire’s colonies; hence, they become more powerful. Each optimization algorithm will end after its run. Imperialist competitive algorithm also follows the condition of convergence and ending the total number of repetition. This algorithm will achieve convergence to reach general optimization. This also happens when all of the empires are eliminated and turn into the colonies of a single empire. This single empire is called the only optimized answer.

Feature selection in this method is used for prediction and diagnosis of heart disease, prevention from implementation of difficult procedures with high side-effects for diagnosis of heart disease, such as coronary angiography, as well as decreasing the number of implementation of useless tests, and therefore decreases the cost. In the next section, it will be explained that how the selected features are used for classification.

3.2 K-nearest neighbor algorithm for classification

K-nearest neighbor algorithm is a learning algorithm with observer that is used by imperialist competitive algorithm to classify the selected features in this research. This algorithm is utilized for two objectives; to estimate the density function of learning data distribution, and to classify data based on the learning patterns. The second objective is used in this study.

4 Evaluation of the results

The dataset used in this study belongs to the UCI Machine Learning Repository and heart disease data of Tehran Shahid Rajaei hospital. Chosen UCI dataset included 303 samples and 13 features [7]. These features are the criteria for diagnosis of heart disease that have been determined by WHO. Dataset of Shahid Rajaei hospital included 303 random patients with chest pain who were referred to the hospital. Data are selected by random selection of people. This data includes 59 features. The results are divided into two datasets in order to ease the evaluation process. These datasets are explained in the following sections.

4.1 Performance measure

Accuracy, Sensitivity, and Specificity are used to make comparison the performance of algorithms, as the most common metrics for assessment in medical field. In Table 1, Confusion Matrix is shown to describe component of three criteria which requires Table 2 information.

Table 1 Confusion Matrix

New hybrid method for heart disease diagnosis utilizing optimization algorithm in feature selection

Abstract

Similar content being viewed by others

Improving Heart Disease Prediction Using Feature Selection Through Genetic Algorithm

A comparative analysis of meta-heuristic optimization algorithms for feature selection on ML-based classification of heart-related diseases

An intelligence method for heart disease prediction using integrated filter-evolutionary search based feature selection and optimized ensemble classifier

1 Introduction

2 Related works

3 Method

3.1 Imperialist competitive algorithm for feature selection

3.2 K-nearest neighbor algorithm for classification

4 Evaluation of the results

4.1 Performance measure

4.2 UCI dataset

4.3 Datasets of Shahid Rajaei hospital

4.4 Extraction of feature selection’s results

4.5 Assessment of available algorithm efficiency

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation