Introduction

Stroke or cerebro-vascular accident (CVA) is a condition in which part of the brain abruptly loses its source of nutrients, oxygen and glucose, that are normally delivered to it by way of the vascular system [11]. Lubis et al. [20] explained that there are two types of stroke, ischemic and hemorrhagic. An ischemic stroke (ISC) is one in which a solid blood clot blocks the flow of blood in an artery to the brain. On the other hand, a hemorrhagic stroke (HEM) is one in which a blood vessel bursts and the blood creates pressure in the brain [20]. Furthermore, a brief stroke attack that come about less than 24 h is called Transient Ischemic Attack (TIA). For over a 1000 years, stroke has become the third of the deadliest diseases around the world, after the heart attack and all types of cancer disease [19]. Research conducted by Park and Ovbiagele [21], explained that people who have the mild disability after stroke disease would potentially experience severe stroke disease in the future.

Based on the data collected by the World Health Organization (WHO), more than 15 million people suffer from stroke diseases each year [29]. Among these, 5 million people died and another 5 million had a long-term disability. If the condition is not addressed and treated properly, then the number of deaths is predicted to increase up to 23.3 million in 2030 [17]. One of the challenges in diagnosing stroke disease is the lack of useful analysis tools to analyse the data. This has an impact on the way stakeholder make the decision based on information obtained from the patient data.

To help health practitioner in different level of treatments to extracts useful knowledge from a large volume of patients’ data, data mining tools obviously are required. Data mining tools provide to unveil previously unknown solution. They improve data analysis on large datasets of patients data to gain useful information for diagnosing and allocating suitable treatment. Many studies used data mining techniques in health environment to perform analysis [8, 9, 25]. Research which was conducted [25], proposed hybrid data mining in order to provide reliable performance to diagnose heart disease and choose the best treatment for the patient. While, [9] implemented three data mining algorithms including CART (Classification and Regression Tree), ID3 (Iterative Dichotomized 3) and DT (Decision Table) extracted from a decision tree or rule-based classifier to create the prediction models using a dataset of heart disease. Moreover, they proposed a new hybrid data mining model. Hereinafter, [8] used classification techniques using principal component analysis (PCA) and linear discriminant analysis (LDA) methods to detect lung cancer disease. Based on the exposure above, one of the advantages of data mining has been used in the field of health is to make the diagnosis of disease by classification techniques. In order to gain insight for collecting information about health, we use Intuitionistic Fuzzy based Decision Tree health for stroke disease in health analytics in this study.

The computational challenges reveal when abundant of data and knowledge results from practical experiences among health worker and professional, which lead them for the intuition deployment within health diagnosis and treatment. It is now related to an approach called as the Intuitionistic fuzzy set. The Intuitionistic fuzzy set was first proposed by Atanassov [2]. It is one of the most effective and efficient ways of representing statements or arguments in linguistics because it calculates the hesitation degree to determine the statement including membership functions or non-membership functions. The benefits of using Intuitionistic Fuzzy are calculating intuition from experts or practician based on their experience, avoid bias and help to diagnosis disease [27]. Based on psychology, intuition is able to quickly process a lot of information without seeing the cognitive effort [14, 28].

Furthermore, hesitation margin of Intuitionistic Fuzzy set is important while considering entropy in the decision tree that will be built [7]. An Intuitionistic fuzzy based decision tree will be applied in data stroke to classify the type of stroke that patients have. Das [10] explained that analytics is the process of transforming raw data into actionable strategic knowledge in order to gain insight into business processes, and thereby guide decision-makers to help businesses run efficiently. Furthermore, the development of analytics is business analytics that aims to identify trends and understand information based on exploration of historical data from multiple data sources using statistical analysis, data mining, and other techniques to enable managers to know information based on previously hidden patterns [4]. There are six stages of business analytics, which are data quality analytics, descriptive analytics, diagnostics analytics, predictive analytics, prescriptive analytics and semantic analytics [10]. Additionally, Raghupathi and Raghupathi [23] proposed specific analytics called as health analytics.

Health analytics is the utilization of medical data, statistical analysis and computer-based model into actionable strategic knowledge in order to assist health service providers obtaining the better understanding regarding patients and making the more proper decision according to the existing facts [23]. Currently, health analytics has been used for healthcare professionals to identify areas of deficiency and suggest potential improvements in the emergency room (ER) [16]. The research applied solutions and monitored two indicators: Long-Stay ER (LOS), and percentage of patients who are left unattended, indicating the effectiveness of the ER. Other research implemented a predictive analytics modeling to generate probability estimation of The Veteran Health Administration (VHA). The predictive analytics was used in VHA to estimate parameters of patient’s past history that provide insight into past behavior that affects future action, and is essential for clinical planning and scheduling decisions to improve patient care [13]. Based on previous research that has been presented above, health analytics has shown to be an effective tool to improve the quality and performance of patient service. However, those studies [13, 16] have not applied more specific types of disease such as stroke. This research determined the classification of stroke using Intuitionistic Fuzzy Decision Tree in diagnostic analytics.

In regard of the potential contribution by the outcomes of this study, there are two points. The first contribution is by the formation of decision tree model in this research, that was implemented in stroke patients medical data using intuitionistic fuzzy based decision tree technique. This result of influential variables is highly providing information for the further and deepen studies concerning stroke diagnosis and its handling or other related diseases.  In order to cope with the idea of diagnosis analytics, the second contribution is to provide relevant scorecards based on influential variable of stroke disease previosly extracted from the rules.

The rest of this study is organized as follows.  Section 2 introduces Intuitionistic Fuzzy based decision tree in order to classify stroke, followed by Sect. 3 with the implementation of Intuitionistic Fuzzy Based Decision Tree for Stroke Diagnosis. Section 4 showed the results of the experiment. Finally the conclusions are given in Sect. 5.

Intuitionistic fuzzy sets (ifs)

The fundamental of fuzzy concept was introduced in 1965 [32]. Generally, Yager and Zadeh [31] explained that fuzzy logic is the logic that underlies the reasoning of a thing by way of precise estimates. They proposed a mathematical form to see how vagueness can be expressed in human languages, such as far, near, big or small. Intuitionistic Fuzzy Sets consists of membership and non-membership function for the amount of data attributes ith [2]. The Intuitionistic Fuzzy Sets represent information more abundant and flexible than the fuzzy set when uncertainty such as hesitancy degree. It is one of the most effective and efficient ways of representing statements or arguments in linguistics because it calculates the hesitation degree to determine the statement including membership functions or non-membership functions. The benefits of using Intuitionistic Fuzzy are calculating intuition from experts or practician based on their experience, avoid bias and help to diagnosis disease [27]. Additionally, Fuzzy logic is also the fastest way to map an input space into an output space using the degree of membership. Moreover, a fuzzy set in X was given as follows

$${\text{A}}^{\prime} = \left\{ {\left\langle {{\text{x}},\,\upmu_{{A^{\prime}}} (x)} \right\rangle \,\left| {{\text{x}}\, \in \,{\text{X}}} \right.} \right\}$$
(1)

where \(\mu_{{A^{\prime}}} \left( x \right) \in \left[ {0,1} \right]\) is the membership function of fuzzy set A. Furthermore, Atanassov [2] purposed the Intuitionistic Fuzzy Set (IFS) in 1983. The IFS is a generalization of fuzzy theory characterized by membership degree and non-membership degree. The IFS also takes into account the Hesitation degree as defined as one taken away the degree of membership and degree of non-membership. The concept of IFS is, let a (crisp) set X be fixed and let A ⊂ E be a fixed set. An IFS A in E is an object of the following form

$${\text{A}} = \left\{ {\left\langle {{\text{x}}, \,\upmu_{\text{A}} \left( {\text{x}} \right) , {\text{v}}_{\text{A}} \left( {\text{x}} \right)} \right\rangle | {\text{x }} \in {\text{X}}} \right\}$$
(2)

where, \(\mu_{A} : X \to \left[ {0,1} \right]\) and \(v_{A} : X \to \left[ {0,1} \right]\), such that

$$0 \le \upmu _{\text{A}} ( {\text{x}}) + {\text{v}}_{\text{A}} ( {\text{x}} ) \le 1$$
(3)

Obviously, every ordinary fuzzy set have the following form

$$\pi_{A} ( x ) = 1 - \mu_{A} ( x ) - v_{A} ( x )$$
(4)

then \(\uppi_{\text{A}} ( {\text{x)}}\) is the degree of hesitation margin. It turns out that the hesitation margin is important while entropy [26] is considered.

Intuitionistic fuzzy based decision tree

Basically, decision tree formulation defines the roots, initial node, and leaf node. The algorithm starts with a training set of tuples and their associated class labels [12]. Decision tree chooses the attributes in order to be the roots, initial node, and leaf node by using entropy and information gain. The following equation is the formulation to determine entropy and information gain [6]:

$${\text{E}} = - \, \sum \limits_{{\text{i}} = 1}^{{\text{K}}} {\text{p}}_{\hbox{i}} {\log}_{2} {\text{p}}_{\hbox{i}}$$
(5)
$${\text{E}}_{\text{Conditional}} = \mathop \sum \limits_{\text{j = 1}}^{\text{M}} \frac{{ | {\text{E}}_{\text{j}} |}}{{ | {\text{E|}}}} \, \times \,{\text{E}}_{\text{j}}$$
(6)
$${\text{Information Gain}} = {\text{E}} - {\text{E}}_{\text{Conditional}}$$
(7)

E is the entropy training set, while pi is the non zero probability that an arbitrary tuple in E belongs to class Ci.

The intuitionistic fuzzy decision tree presented here was using intuitionistic fuzzy entropy in order to get entropy and information gain. The following equation is the formulation to determine intuitionistic fuzzy entropy as denoted as EIFS [26]:

$${\text{E}}_{\text{IFS}} ( {\text{x)}} = \frac{ 1}{\text{n}}\mathop \sum \limits_{\text{i}= 1} ^{\text{K}} \left( {\frac{{{ \hbox{min} }\left\{ {{\text{H}}_{\text{ifs}} \left( {\text{x,M}} \right) , {\text{H}}_{\text{ifs}} \left( {\text{x,N}} \right)} \right\}}}{{{ \hbox{max} }\left\{ {{\text{H}}_{\text{ifs}} \left( {\text{x,M}} \right) , {\text{H}}_{\text{ifs}} \left( {\text{x,N}} \right)} \right\}}}} \right)$$
(8)

where M and N are the intuitionistic fuzzy elements \(\left( \langle \upmu , {\text{v}},\;\varvec{\pi} \rangle \right)\) fully belonging (M) and not fully belonging (N). Furthermore, \({\text{H}}_{\text{ifs}}\) is the normalized Hamming distance with the highest value of membership, nonmembership and hesitation [26]:

$${\text{H}}_{\text{ifs}} ( {\text{x,M)}} = \frac{ 1}{ 2}\left( {\left| {\upmu_{\text{x}} \, - \, 1} \right| + \left| {{\text{v}}_{\text{x}} \, - \, 0} \right| + \left| {\uppi_{\text{x}} \, - \, 0} \right|} \right)$$
$${\text{H}}_{\text{ifs}} ( {\text{x,N)}}\,{ = }\,\frac{ 1}{ 2}\left( {\left| {\upmu_{\text{x}} \, - \, 0} \right|\,{ + }\,\left| {{\text{v}}_{\text{x}} \, - \, 1} \right|\,{ + }\,\left| {\uppi_{\text{x}} \, - \, 0} \right|} \right)$$

Algorithm 1: Intuitionistic fuzzy based decision tree algorithm:

  1. 1.

    Create root node for the tree that has a set of intuitionistic fuzzy data

  2. 2.

    If a node t with an intuitionistic fuzzy set of data D satisfies the following conditions, then it is a leaf node and assigned by the class name.

    • The proportion of a class Ck is greater than equal to Ɵr, \(\frac{{\left| {D^{ci} } \right|}}{\left| D \right|} \ge\uptheta_{r}\).

    • The number of data sets is less than Ɵn.

    • There are no attributes for more classifications.

    • Stop the tree building

    • Prune the tree

  3. 3.

    If node D does not satisfy the above conditions, then it is not a leaf-node, and a new sub-node is generated as follows:

    • Counting the frequency of classes

    • Deriving the IFS

    • Calculating the intuitionistic entropy

    • Choosing the greatest value of information gain.

    • Deriving the child node for each information gain

    • Replacing the D by Dj (j = 1,2,…, m) and repeat from 2 recursively

Implementation of intuitionistic fuzzy based decision tree for stroke diagnosis

Data description

In this study, we utilized data from BioMed Central [24], which consisted of 19.436 patient data. With 114 variables of medical records from patients with stroke and 85% of patients is the ischemic stroke. After conducting the discussion with experts and data preprocessing (missing values and in order to improve the data quality, eventually 37 variables and 18.425 patients data were used in this study. Early stroke symptom is presumed by analyzing and assessing three types of factors, namely General identification, Drug type and drug condition, and Patient condition, where each contain numbers of variables as further presented in Table 1. These attributes were all important factors which affect individual health in stroke disease.

Table 1 Sample set fields

Furthermore, the dataset is divided into training data and test data by using K-fold cross-validation to divide the dataset into 10-fold cross-validation. This process divided data randomly into 10 sections, then 10 experiments were run sequentially, where each experiment deploy the 10th partition data as testing data and the rest of the partition as training data. Cross-validation validates the accuracy of the classification model [30]. Referring to [15, 18, 33] 10-fold cross-validation has proved that various tests with different datasets and different learning techniques have shown that 10 is the right number of folds to get the best error estimates.

Rules from tree

According to Algorithm 1, things that should be done is create the root node, splitting into branch node and get the leaf node.

  1. 1.

    We change data into IFS membership function forms. For instance, we used the age of patient as a IFS membership function form as shown in Fig. 1.

    Fig. 1
    figure 1

    Membership function of patient age

$$\mu \,Young\,Adults\left( x \right)\, = \,\left\{ {\begin{array}{*{20}ll} {0 ; \quad x \, < \,16\; {\text{and}}\, \ge \,40} \\ {\frac{x\, - \,16}{28\, - \,16} ;\quad 16\, \le \,x\, < \,28} \\ {\frac{40\, - \,x}{40\, - \,28} ;\quad 28\, < \,x\, \le \,40} \\ \end{array} } \right.$$
(9)
$$v \,Young\,Adults\left( x \right)\, = \,\left\{ {\begin{array}{*{20}c} {\frac{20\, - \,x}{20\, - \,0} ; \quad x\, \le \,20} \\ {\frac{x\, - \,36}{40\, - \,36} ; \quad x\, < \,40} \\ \end{array} } \right.$$
(10)
  1. 2.

    After converting the data into IFS, we split the data into two, data training and data testing. Data training is used to build the decision tree, then data testing is used to test data on training data. Furthermore, we calculate entropy to determine the next step. The way calculate entropy, has been shown in Eq. 8. Initially, the calculated entropy is TARGET entropy.

Step 1: Calculating the Information Entropy of training set sample S according to  Eq. 5. The training is 14.750 sample, so \(n =\) 14.750. These samples were classified into four types: ISC, HEM, TIA and Nonstroke. The number of ISC is 12.317, HEM is 1472, TIA is 620 and Nonstroke is 341. Then the result of entropy as follows:

$$\begin{aligned} E\left( S \right) = & \left( { - \frac{13.371}{14.750 }} \right)\, \times \,\log_{2} \left( {\frac{13.371}{14.750}} \right)\, + \,\left( { - \,\frac{427}{14.750 }} \right)\, \times \,\log_{2} \left( {\frac{472}{14.750}} \right) \\ & + \,\left( { - \,\frac{620}{14.750 }} \right)\, \times \,\log_{2} \left( {\frac{620}{14.750}} \right)\, + \,\left( { - \,\frac{341}{14.750 }} \right)\, \times \,\log_{2} \left( {\frac{341}{14.750}} \right)\, + \,0.609 \\ \end{aligned}$$

Step 2: Calculate The Intuitionistic Fuzzy value using Intuitionistic Fuzzy Entropy according to Eq. 8. In this paper, we used Age Variable for example where the value is 57. The result of age variable with the value 57 for membership is 0.25, nonmembership is 0.75 and hesitation is 0.25. Then the result of intuitionistic fuzzy entropy as follows:

$$\begin{aligned} {\text{H}}_{\text{ifs}} ( 5 7 , {\text{M)}}\, & = \,\frac{ 1}{ 2}\left( {\left| {\upmu_{\text{x}} \, - \, 1} \right|{ + }\left| {{\text{v}}_{\text{x}} \, - \, 0} \right|{ + }\left| {\uppi_{\text{x}} \, - \, 0} \right|} \right) \\ & = \,\frac{ 1}{ 2}\left( {\left| { 0. 2 5- 1} \right|\,{ + }\,\left| { 0. 7 5- 0} \right|\,{ + }\,\left| { 0. 2 5\, - \, 0} \right|} \right) \\ & = \, 0. 8 7 5\\ \end{aligned}$$
$$\begin{aligned} {\text{H}}_{\text{ifs}} ( 5 7 , {\text{N)}}\, & { = }\,\frac{ 1}{ 2}\left( {\left| {\upmu_{\text{x}} \, - \, 0} \right|\,{ + }\,\left| {{\text{v}}_{\text{x}} \, - \, 1} \right|\,{ + }\,\left| {\uppi_{\text{x}} \, - \, 0} \right|} \right) \\ & = \frac{1}{2}\left( {\left| { 0. 2 5\, - \, 0} \right|{ + }\left| { 0. 7 5\, - \, 1} \right|\,{ + }\,\left| { 0. 2 5\, - \, 0} \right|} \right) \\ & = 0. 3 7 5\\ \end{aligned}$$
$$\begin{aligned} {\text{E}}_{\text{IFS}} ( 5 7 )\, & { = }\,\frac{0.375}{0.875} \\ & = 0. 4 2 8\\ \end{aligned}$$
$$\begin{aligned} {\text{E}}_{\text{IFS}} ( {\text{Age)}}\, & { = }\,\frac{ 1}{ 1 4 7 5 0}\, \times \,\left\{ {0.428 + \cdots + (n = 14750} \right\} \\ & = 0.56 \\ \end{aligned}$$

Step 3: Calculate information gain according to Eq. 7 and \({\text{E}}_{\text{Sub}}\) replaced by \({\text{E}}_{\text{IFS}}\) as follow:

$$\begin{aligned} Information \,Gain\, & = \,{\text{E}}\left( {\text{S}} \right) - {\text{E}}_{\text{IFS}} \left( {\text{Age}} \right) \\ & = 0.609 - 0.56 \\ & = 0.049 \\ \end{aligned}$$
  1. 3.

    Then all attributes are calculated entropy based on entropy TARGET as in Eq. 6. After getting the value of entropy, we calculate the information gain, according to the Eq. 7. We repeated those steps until there were no attributes left. If the size of the tree is huge, we should prune the tree. A pruning technique is used to reduce the size of the tree by eliminating the small possibility to classify TARGET. Pruning aims to predict the complexity of decision trees that have been made but still have high accuracy in making the diagnosis. The pruning technique used is error based pruning (EBP) as it is in the J.48 algorithm. EBP prunes the internal node from the bottom of the decision tree with  Eq. 11 [22].

    (11)

    where ẻ\(\left( t \right)\) is the error rate internal node, ẻ\(\left( {T_{t} } \right)\) is the error rate with internal node \(\left( t \right)\), while std(\(( T_{t} )\)) is standard of error rate subtree with internal node t with Eq. 12 for the detail.

    (12)

    where \({\text{n}}\left( {\text{t}} \right)\) is the number of nodes checked for error rate.

The evaluation of model

Performance evaluation is conducted on the model with the aim to know how well the model performance using test data. Evaluation is based on accuracy, sensitivity and specificity. Accuracy is the degree of closeness of measurements of the quantity to that true value of the quantity, sensitivity is the fraction of the relevant samples that are retrieved, and specificity calculates the actual positive proportions correctly identified. Accuracy, sensitivity and specificity are calculated as shown in Eqs. 13, 14, and 15 based on Table 2 [5].

$${\text{Accuracy }} = \frac{\text{True Positive + True Negative}}{\text{True Positive + False Positive +True Negative+False Negative}}$$
(13)
$${\text{Sensitivity}} = \frac{\text{True Positive}}{\text{True Positive + False Negative}}$$
(14)
$${\text{Specificity}} = \frac{\text{True Negative}}{\text{True Positive + False Positive}}$$
(15)
Table 2 Confussion matrix

Diagnosis analytics

Diagnosis analytics is used to assess why a particular result can occur based on existing data. At this level of analytics, the context of the data is examined, and the factors which have the possibility of contributing to the results are evaluated [3]. In order to examine the data and showed the cause-effect relationship among variables of stroke disease, we used the scorecard. The aims of the scorecard are displaying the record summaries of data, shows the cause-effect relationship variables of stroke disease, and periodically updated chart. Diagnosis analytics helps the paramedics to identify the causes leading to the realized performance. Diagnosis analytics include understanding the impact of the input factors of stroke and operational policies on the performance measures. The resource of diagnosis analytics based on the result of information gain from Intuitionistic Fuzzy based Decision Tree.

Result and discussion

After performing an implementation using Intuitionistic Fuzzy based Decision Tree, attributes that influence classification are created into a tree and shown in Fig. 2. 40 attributes existed before, only 25 attributes have the most significant influence based on the value of the calculated information gain. These attributes determine the formation of decision trees and the classification of stroke diagnoses in this study.

Arslan et al. [1] discussed the classification of specific stroke diseases (ischemic) using data mining techniques, namely Support Vector Machine (SVM), Stochastic Gradient Boosting (SGB) and penalized logistic regression, with different group variables. However, several similar results found in the final outcome variables most influential with the results of this study, including age variables. Arslan’s research uses the elements of medical checkup analysis (e.g., White blood cell, Hematocrit, Hemoglobin, Platelet).

Using the same data, decision tree based algorithms were compared to assess the accuracy of this study, i.e. J48 and C5.0. Table 3 shows that the resulting accuracy value is not significant. This is due to techniques in making the decision tree used was almost the same, so the accuracy obtained was not much different. These results were not expected where significant results should be used as a basis/further analytical basis, such as the initial diagnosis of the patient before further action was taken. So in the future research improvements can be applied on the more specific variables to increase the value of significance. Compared with the prior studies, insignificant results were also obtained by [7] which examined the glass by compared Random forest, J48, Soft Decision Tree and Intuitionistic Fuzzy Decision Tree itself had an insignificant accuracy.

Fig. 2
figure 2

Tree formation

Table 3 The comparison of IFS decision tree and other classifiers

What difference the algorithm was the establishment of a decision tree model based on its attributes. In the formation of J.48 and C5.0, rules were almost the same, the excellence in Intuitionistic Fuzzy-based Decision Tree is the hesitation degree in the formation of rules to be taken into account. The result of tree formation as seen in Fig. 2. By selecting the same node, the classification results indicated the type of Ischemic stroke, and the resulting rule was different. The rule in Intuitionistic Fuzzy based decision tree possessed hesitation degree in variable age, moreover the hesitation degree could make the rule more specific in classification the type of stroke disease (Table 4).

Table 4 The comparision of generated rule using J48, C5.0 and IFS based decision tree
Fig. 3
figure 3

Classification model performance

Evaluation of classification results is calculated using several methods, such as precision, sensitivity, and specificity. Table 5 is a confusion matrix for the stroke classification model. Subsequently.

Table 5 Confusion matrix of tree construction

Model evaluation using testing data and training data, in this research indicates that the classification model of the type of stroke disease produced is quite good. The AUC value for the classification model in this study shows a high-performance level (AUC = 0.73) as shown in Fig. 3. Based on the information gain of Intuitionistic Fuzzy based Decision Tree the scorecard is shown in Fig. 4. Figure 4 illustrates the influential variable of stroke types on Intuitionistic Fuzzy based Decision Tree models to understand the underlying causes of stroke disease.

Fig. 4
figure 4

Diagnosis analytics based on type of stroke

Conclusion

This paper contribute a formulation of Intuitionistic Fuzzy based Decision Tree to classify types of stroke in BioMed Central data. The advantage of Intuitionistic Fuzzy based Decision Tree is the core of diagnosis analytics as a classification model to support physicians and paramedics to determine future action in patients. This model accommodates the doctor's and paramedic's intuition and provides a crisp solution that applies to particular cases. The Proposed diagnosis analytics proofs as an effective and efficient way of representing statements or arguments with utilization of the hesitation degree to help deeper interpretation with membership functions and non-membership functions. This approach showed a better comparable understanding to other similar decision tree models.