Keywords

1 Introduction

In the age of computer technology, there is no shortage of data for analysis. The main problem is to decide what research method to apply, what insights we can get from this data and what decisions they can propose.

Typically, the analysis of a data begins with the application of classical methods of descriptive statistics and visualisation of data, which can help discover a data pattern or show trends in data change. One of the initial data analysis steps is to determine measures of central tendency (mean, median, mode) or measures of variability (standard deviation, data width, variance, asymmetry factor, excess). These characteristics provide a better understanding of the nature of research object and provide an initial picture of the data, their layout, quality and completeness.

The characteristics of data can be easily discovered by using a variety of computer programs, starting from the widely accepted MS EXCEL to specialized statistical calculation environments such as SPSS, STATISTICA, Matlab or Google BigQuery. For solving more advanced research problems, we will use different software and computer tools that will allow the reader to consider the most appropriate solution for a specific artificial intelligence problem.

The discussion and comparative evaluation the artificial intelligence approaches, and the illustration of their performance by applying different AI methods and tools should help us to reveal advantages of artificial intelligence and machine learning methods in the area of application of health data analysis in different cross-sections. This research topic is very popular and attract attention of many researchers [1,2,3,4,5,6,7]. For this purpose, we will take a big real clinical record file and try to analyse it using various research methods.

The database applied for the experimental research is a collection of registered stroke cases of the neurology department of Clinical Centre in Montenegro. The database consists of the structured records of 944 different patients, 58 variables, where 50 of them are coded by scale values of [1, 8, 9] corresponding to “Yes, No, Unspecified” conditions, and 8 variables consisting of the demographic data, admission date and discharge date from hospital. The data is collected between 02/25/2017 and 12/18/2019. The demographic data of stroke patients varies by age (from 13 to 96 years), and gender (485-male, 427-female).

Further, we will introduce several data research methods letting us to examine the structure of the data, find important patterns and disclose the relationships of the most important variables. We will try not only to present various research methods, but also will explain how to clean and transform the original data according to the task requirements, and to use different software tools for specific artificial intelligence and machine learning methods.

The next section will focus on understanding regression and correlation analysis and analysing the dependence strength of our data.

The Sect. 10.3 will examine logit and probit regression application to predict the variable Vital_Status of the stroke patient from different individual characteristics, such as Type of stroke, Treatment methods, Health modified ranking score before stroke, Age at stroke and Gender. Here we will introduce Google BigQuery Machine Learning capabilities to address this type of challenge.

The Sect. 10.4 describes the unsupervised machine learning method k-Means. This method let us partition data records to the predefined number of clusters. The calculations will be performed by the help of Matlab software.

The Sect. 10.5 explored application of neural networks for the supervised learning case of classification, by applying STATISTICA Data Mining tools.

2 Correlation and Regression Analysis

In general, regression analysis is a statistical method that allows the estimation of dependence among two or more quantitative variables in order to predict a dependent variable [10].

The simplest regression dependence is linear:\(y = \beta_0 + \beta_1 x\). The coefficients of equation are found by the least squares method, i.e. minimizing differences between the points \((x_i ,y_i )\) and the regression curve. The regression analysis methods are very widely used in medical research. Usually, to draw regression line and calculate determination or correlation coefficients we can use MS Excel software, but here we will apply the STATISTICA software package and limit our analysis to providing an example of the simplest regression curve.

We will illustrate the task of finding interdependence among the number of days spent in hospital and the age of the patients at stroke, which varied between 20 and 50 years, in Table 10.1 there is the sample of data set used.

Table 10.1 Example of the data records

Firstly, we explore a scatterplot for visualisation of data and finding a linear regression equation (Fig. 10.1).

Fig. 10.1
figure 1

Scatterplot for visualizing linear regression between Age and Days at Hospital

As from Fig. 10.1, the linear regression equation for variables denoting age and days at hospital is: \(Days\,At\,Hosp = - 4,9906 + 0.4102 \cdot Age\). It enables to estimate forecast number of days to be spent at hospital according to the age at stroke. The relevance of the results, and its suitability for forecasting is judged by the coefficient of determination. In our case (Fig. 10.1) the determination is equal to \(r^2 = 0.0468\). The coefficient of determination is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable in the range from 0 (no dependence between variables) to 1 (indication of a perfect fit). In the solved example only approximately 5% of variation of the dependent variable Days At Hosp can be explained by using the independent variable Age.

If the relationship between the variables is not well-fitted to linear (as in Fig. 10.1) we may use the non-linear regression. Then, instead of a line, we explore parabola, exponential, or logarithmic equations, and determine their unknown parameters by the least squares method. If a dependent variable is not well predicted by a single variable, several independent variables can be used to more accurately describe the situation. This type of regression is called group regression. Typically, group linear regression uses no more than 5 or 6 additional variables. Both group and curve regression calculations can be performed using the software tools already mentioned.

3 Logit and Probit Models

Traditional regression methods sometimes have difficulty describing a dependent variable that acquires values only from the range [0,1] or values of 0/1 (true/false, success/failure, error/non-error, etc.). In this case, logit or probit regression [11] are appropriated. The main difference among these two models is the different link function. The logit model uses cumulative distribution function of the logistic distribution, and probit invoke the cumulative distribution function of the standard normal distribution. Both functions may take any number as input, and rescale it to fall within the range of [0; 1]. These regression dependencies are applied for in medical, social science tasks, and are widely used to solve marketing and financial problems.

In order to illustrate the performance of these models we chose an example task, how the condition of blood pressure (0-normal, 1-high) may depend on age, weight, physical activity and stress level of the patient. Other similar examples could be evaluation of prostate enlargement (0-enlarged, 1- normal) from the available health indicators of the patient.

The basic assumptions of the logistic regression model are defined [12]: suppose the dependent variable y acquires a value of 1 with probability p, and it acquires a value of 0 with the probability of \({\text{q}} = 1 - {\text{p}}\). The types of independent variables for building logistic regression model can take any values, i.e. quantitative, qualitative, or categorical. The distribution of the input variables is not restricted for this model either.

In the logistic regression, the relationship between the outcome variable and the descriptive variables is not a linear function, as it was in the case of linear regression. The model of Logistic regression correlates probability value p with the independent variables \(x_1 ,x_2 ,...,x_n\):

$$p = \frac{1}{{1 + e^{ - (a + b_1 x_1 + ... + b_n x_n )} }},$$

where a is a constant, bi—the regression weights of the independent variables.

This equation takes another form after applying the logistic transformation function (logit) [12].

$$Logit\left( p \right) = {\text{In}}\left( {\frac{p}{1 - p}} \right) = a + b_1 x_1 + b_2 x_2 + \cdots + b_n x_n .$$

The link function of the logit regression is expressed by \(f(x) = \frac{1}{{1 + e^{ - x} }}\), and link function of probit regression is a function of the standard normal distribution \(\Phi \left( x \right) = \int_{ - \infty }^x {\frac{1}{{\sqrt {2\pi } }}} \exp \left( { - \frac{y^2 }{2}} \right)dy\), which only slightly differs from the logistic one (Fig. 10.2).

Fig. 10.2
figure 2

Logit and probit link or transformation functions

Thus, we can define probabilistic regression as follows:

$$p = \Phi (Z) = \int\limits_{ - \infty }^Z {\frac{1}{{\sqrt {2\pi } }}\exp \left( { - \frac{x^2 }{2}} \right)} dx,$$

where \(Z = a + b_1 x_1 + b_2 x_2 + \ldots + b_n x_n .\)

The logistic and probabilistic regressions differ only by the transformation function, which determine differences of the behaviour of these models. The normal distribution function grows faster than the logistic one, therefore it provides a higher sensitivity to probabilistic regression, i.e. dependence on descriptive variables [13].

The logit and probit regression models belong to the class of the supervised machine learning techniques. It means that the training set with the labelled examples is available for building the model. A supervised learning algorithm analyses the training data learning input/output regularities and produces an inference function, which can be used for estimating output for the new input examples.

We will solve the illustrative example of logit regression with help of Google BigQuery ML (see [8]). BigQuery ML enables to create and execute machine learning models by using standard SQL queries and the ML libraries. BigQuery ML supports not only the linear and logistic regression models, but also provides tools to apply K-means clustering, Matrix factorisation, Time series, Deep Neural Network and other computational intelligence methods.

The 944 data records of patients diagnosed with stroke were used for estimating logit and probit models. We will explore the “Vital status after hospitalisation” (1-Alive, 0-No) as a dependent variable, which is possibly affected by 5 independent variables: (1) Type of stroke, (2) Treatment methods, (3) Health modified ranking score before stroke, (4) Age at stroke, and (5) Gender.

In Table 10.2 the excerpt of data transactions and corresponding variables are presented.

Table 10.2 Example of the data

Variable Stroke_Type gains value 1 for the diagnosis Ischemic stroke, 2 for Hemorag, 3 for SAH and 4 for unspecified stroke. The variable Vital_Status after hospitalisation may take values of Alive marked by 1, or not alive- 0. Variable Treatment_Methods denote categories of medications, or their combinations, received during the hospital stay, the corresponding values are in Table 10.3.

Table 10.3 Codes for treatment methods

For example, the code value 13 means combining two types of medication Anticoagulation and Thrombolysis, 24-Dual Antiplatelet Therapy and medications from the broad group Other. Health_Status is evaluated from 0 to good health to 6-very bad health, 9-stands for unknown. Gender code 1 means male, 0-female. The variable Data_Frame ensures the random distribution of the database records to Training-T, Evaluation-E and Prediction-P sets.

The logit regression model can be processed in BigQuery ML, here we need to open Google Cloud platform, BigQuery sandbox, set a new project, create the dataset and upload the data file. Designing the logistic regression model consists of the following steps:

  1. 1.

    Create and train the logistic regression model on training data.

  2. 2.

    Evaluate the model performance with evaluation set of data

  3. 3.

    Predict the output from inputs prediction data

For model creation task we can write a simple SQL query (Fig. 10.3).

Fig. 10.3
figure 3

Model creation statements

Here the ‘Logit.Logit’ is the name assigned to uploaded table. The achieved performance of logit regression by classifying Vital_Status can be seen from Fig. 10.4.

Fig. 10.4
figure 4

Evaluation of trained logit model

In Fig. 10.4, the confusion matrix is presented as a table in which predictions are represented in columns and actual status is represented by rows. The performance of the model is explored by applying several characteristics of precision evaluation: accuracy, recall and precision. The characteristics of Accuracy shows what percent of all values are correctly predicted by the model. In our case the general accuracy of prediction is close to 70%. Recall calculates the percent of correct predictions of Vital_Status for all the true values (=1). It means, that the performance of the model for the Vital Status value “1” is better, than general performance, and equals to 76,83%. The proportion of the instances which were correctly recognized as positive (per total positive predictions) is called the Precision. The F1 score denotes the harmonic mean of Precision and Recall. The accuracy of the model may be satisfactory or not sufficient depending on the requirements and complexity of the task solved. In the Fig. 10.4 shows performance of the Logit regression model on training set.

To see the performance of our model on evaluation set we can write an evaluation SQL query on evaluation set (Fig. 10.5).

Fig. 10.5
figure 5

SQL for model evaluation

When the SQL query is executed, BigQuery calculates the accuracy and other model performance characteristics on evaluation set (Fig. 10.6).

Fig. 10.6
figure 6

Evaluation results

As we can see the logit regression model performs on evaluation set even better than on training set. The accuracy and other ratio estimates shows good classification power of Vital_Status variable.

The last logit regression modelling step provides model adaptation for prediction set. For this case we need to write prediction SQL query (Fig. 10.7).

Fig. 10.7
figure 7

SQL query for prediction set results

The execution of this query let us find the predictions of Vital_Status and present the model application results in table (Table 10.4).

Table 10.4 Prediction results for prediction set

Comparison of the columns “Predicted_Vital _status” and the original values “Vital_status” in Table 10.4 shows that part of the predicted values differ from the original ones, but the overall accuracy calculated for all Prediction set records is equal to 0.684426. This value lets us to conclude the good logistical classification capabilities by applying this method. The presentation of the outcomes in Table 10.4 gives possibility for the advanced further analysis, as the expert analysis of the incorrectly predicted cases may bring insights for adding more input variables, or introduce changes to their coding in order to reduce confusion among the predicted classes and increase the accuracy of model.

4 k-Means Clustering

Unlike the Logit and Probit modelling, the Cluster analysis belongs to the class of the unsupervised learning techniques, which enables to find natural groupings and patterns in data, without need of the labelled data set for model training.

K-means clustering is a data partitioning method for assigning records (or objects) to the predefined number of clusters. K-Means treats each observation as an object that has its location in a multidimensional space. The algorithm of k-Means finds a partition in which objects within each cluster are as close to each other as possible, and, at the same time, as far as possible from the objects of the other clusters. Based on the attributes of our data, we can select one of the generally applied distance metric to be used by the k-Means model for calculating distances among the clusters and distances between the instances within cluster.

As k-Means clustering creates a single level of clusters it is suitable for both large amounts of data objects and numerous attributes. Each cluster in a k-Means partition consists of its member objects and has a predefined centre or centroid. K-Means method tries to minimize the sum of the distances between the centroid and each member object of the cluster. The computation procedure depends on the applied distance metrics. By default, k-Means uses the squared Euclidean distance metrics to determine distances. The visualisation of the output of the method plots the clusters on the two-dimensional space for simplification of the analysis, however the underlying computations deal with the multidimensional settings.

The following steps are performed for k-Means clustering (k-Means Clustering): [14]:

  1. 1.

    Examine k-Means clustering solutions for different selected number of clusters k to determine optimal number of clusters for the data set. Some tools (such as Statistica or Viscovery SOmine) offer estimation of optimal number of clusters;

  2. 2.

    Evaluate clustering solutions by analysing silhouette plots and silhouette values, or based on criteria, such as Davies–Bouldin index values, and Calinski–Harabasz index values;

  3. 3.

    Replicate clustering from different randomly selected centroids and return the final solution with the lowest total sum of distances among all the replicates.

A silhouette value is a standard measure of how close the points of one cluster are to the points of the adjacent clusters. This measure takes values from interval [−1,1]. The value “−1” denotes the points that are probably assigned to the wrong cluster, and silhouette value equal to 1 indicates points that are very distant from the neighbouring clusters. Usually silhouette values are presented graphically by the silhouette plot, which enables to choose the right number of clusters.

The criteria-based method for finding the optimal number of clusters include calculation of Davies–Bouldin (Davies–Bouldin index) or Calinski–Harabasz (Indice de Calinski–Harabasz) index values. Without going into the technical details of calculating these indicators, we will summarize that the optimal number of clusters is indicated by the lowest indicators values.

To illustrate the application of the k-Mean clustering method, we used the extended version of the previously described data file containing various health and personal characteristics of patients diagnosed with stroke. Part of the attributes of this file were explained in Table 10.1, such as the variables of Vital_Status, Type of stroke, Treatment methods, Health modified ranking score before stroke, Age at stroke, Gender, and Days spent in hospital.

For the further study the dataset was expanded by variables expressing other characteristics of the patient history with the assumption that additional knowledge about the patient may increase prediction power of the mode. The information whether there was a stroke before, specific Stroke symptoms, and indication of Health complications may enable us to better distribute the patients into meaningful groups and recognise the useful patterns of data. The example of data file records used for k-Means clustering is presented in Table 10.5. According to the concept of multidimensional space associated with clustering computational models we can imagine the file records as points in 9-dimensional space. Contrarily to the supervised methods, all variables serve as inputs.

Table 10.5 Example data file records for clustering

For this example, we filtered only the record of patients with the Vital_status = 1 (Alive), therefore 642 records we used for research of K-Means clustering. The clustering of patients into predefined number of clusters can be useful in case of meaningful categories (cluster) applied in medical practice, such as separating patients for Rehabilitation, Medication prescription or for appointment of specific health strengthening procedure.

In Table 10.5, Past_Stroke value equal to 1 means the repeated stroke case, Stroke_Symptoms can take values of: 0-No symptoms, 1-Impaired consciousness, 2-Weakness/paresis, 3-Speech disorder (aphasia) or joint occurrence of several symptoms, e.g. 13 indicates Impaired consciousness and Speech disorder (aphasia). Health_complications are divided into four different groups: 1-deep vein thrombosis, 2-other CV complications, 3-pneumonia, 4-other complications. 0 value stands for unspecified complications, and, similarly, the code 13 expresses the double complications of deep vein thrombosis and pneumonia.

The k-Mean clustering can be done by various software, but here we will use MATLAB R2020b version. MATLAB® [15] combines a desktop environment tuned for iterative analysis and design processes with a programming language that expresses matrix and array mathematics directly. Using the predefined Matlab functions we can perform all popular classification, regression, and clustering algorithms for supervised and unsupervised learning.

Matlab enables us to fine tune all parameters of clustering by writing a program code, and to find the optimal number of clusters, as well as to evaluate the clustering solutions by analysing silhouette values, Davies–Bouldin and Calinski–Harabasz index values.

The analysis and clustering of the described data file by using k-Mean algorithm is executed by the Matlab program code presented in Fig. 10.8.

Fig. 10.8
figure 8

Matlab code for k-Means algorithm

The operator on program line 6 enables us to specify a testing set for the unsupervised learning of k-Means algorithm and to select the attributes. As it it specified by operator 6, for this case we have selected 440 cases starting from record 21 to 460. After initial computation phase we have noticed that variables Days at hospital and Gender have negative influence to the K-Means performance. So, for the following computation stage we excluded them from our research, and tried to find the clusters only by selecting 2, 3, 4, 5, 7, 8, 9 attributes (see Table 10.4).

In order to find the optimal number of clusters we calculated the Davies–Bouldin and Calinski–Harabasz index values (8 and 9 program lines, Fig. 10.8). The output of 9 line is presented in Fig. 10.9.

Fig. 10.9
figure 9

Calinski–Harabasz and Davies–Bouldin criterion values

The optimal number of 3 clusters was suggested by the Calinski–Harabasz criterion, but Davies–Bouldin criterion advices optimal number of 2 clusters. Therefore we explored both cases of 3 and 2 clusters for evaluation.

The estimation of the K_Mean model to our data was started with k = 3 (line 10), it calculated the best total sum of distances to the centroids and average silhouette values. The calculation results are presented on Fig. 10.10.

Fig. 10.10
figure 10

k-Means accuracy verification for 3 clusters

In Fig. 10.10 the silhouette values equal to 0.7695, which confirm the excellent partition of our cases to 3 clusters. The silhouette plot on Fig. 10.11 visually confirm this assertion. Only very small number of cases have silhouette values less than 0.6 (Fig. 10.11).

Fig. 10.11
figure 11

The silhouette plot for 3 clusters

Application of k-Means model calculations for 2 clusters show worse performance comparing to the case of 3 clusters (Fig. 10.12).

Fig. 10.12
figure 12

k-Means accuracy verification for 2 clusters

Although the average of silhouette values of 0.6846 for 2 clusters only differ by small amount from those of 3 clusters. However, the selection of partition of cases to 3 clusters may be more adequate by final expert judgement. After applying the selected K-means clustering model, the clusters can be further explored according to numerous characterstics of the variables included to different clusters. We use Matlab program code to calculate the Number of cases in clusters (line 19) and Sums of distances to centroid centre (line 17) in order to characterize the size and similarity of objects within clusters (Table 10.6).

Table 10.6 Cluster information

In order to check the membership of a particular patient (or group of patients) to some cluster we may apply different functions of the machine learning environment, such as Matlab: as an example, the command in line 18 (Fig. 10.8) displays the clusters for cases from 41 to 50.

Based on the demonstrated example, we can state that the application of Matlab for machine learning algorithms has a high degree of configuration freedom, allows the researcher to control the parameters of the method and test various computational scenarios. Understanding the background principles of the machine learning models and the flexibility of their application in different computational environments enables domain experts and researchers to derive important analytical insights.

5 Artificial Neural Network

The Artificial Neural Network (ANN) model is inspired by the biological neural network. It can learn to perform tasks by observing examples, without applying any rules of a particular task.

The ANN model and its modifications is widely used in various application domains, such as language recognition, machine translation models, social network filtering, facial recognition, financial instrument prediction, and many more, where the tasks of classification or time series forecasting are relevant. In medicine, ANN is used to diagnose various diseases and their complications, to evaluate the effects of drugs, to predict the duration of treatment, or to cluster medical anomalies. ANN may link the symptoms of patients with a specific disease, and learn to identify the disease accordingly.

Contrarily to the statistical methods, the ANN is a data-driven approach, therefore the ANN model is trained by available data set for applying it in the testing conditions of the researched domain. For each of the tasks, it is necessary to set up an appropriate neural network. The following methodology should be followed:

  1. 1.

    Preparation of data for the study. These include data collection, organization, normalization, preparing the training and testing sampling.

  2. 2.

    Selection of ANN structure. It is determined by the number of outputs, input variables, hidden layers and the number of neurons of the model. The neuron connection principles, threshold and transmission functions should be determined as well.

  3. 3.

    ANN training. The network training strategy, training algorithm and the training effectiveness needs to be evaluated.

  4. 4.

    Network testing. The evaluation of the created neural network is performed by using an input data set, other than the one used for its training.

All these tasks are highly interrelated and influence the quality of the model. Depending on the available input data set and the task being solved, the appropriate network structure is modelled for applying the most suitable ANN training algorithm. The two most common neural network structures, such as Single-layer perceptron and backpropogation network (multilayer perceptron) are further discussed and explored by presenting the experimental sample.

Single-layer perceptron

Single-layer perceptron is the simplest form of ANN used to classify linearly separated structures. It is a single-layer direct propagation neural network with a threshold transmission function (Fig. 10.13). Rosenblat [16] proved that if such a network is trained by examples from linearly separable classes, then the perceptron algorithm converges and finds a hyperplane separating those classes.

Fig. 10.13
figure 13

Single-layer perceptron

The solution of the perceptron equation ∑(x) = 0 defines a line or hyperplane as the boundary between distinct classes. The solution is obtained by learning the network and choosing the correct network weights. As mentioned, perceptron can only distinguish between linearly separable classes (Fig. 10.14). To describe the structure and training algorithm of perceptron, we will use notations according to Hajek [17].

Fig. 10.14
figure 14

Linearly separable classes (a), Not separable linearly classes (b)

Perceptron learning is a supervised learning system. Thus the training set consists of pairs \({(x}^{(p)},{d}^{(p)})\left|\genfrac{}{}{0pt}{}{N}{p=1}\right.\), where \({x}^{(p)}\) denote the input vector\({x}^{(p)}{=({x}_{1}^{\left(p\right)},{x}_{2}^{\left(p\right)},\dots , {x}_{n}^{\left(p\right)})}^{T}\), and \({d}^{(p)}\) is the known output a vector (teacher) whose components can acquire only two values: 0 or 1. Let \({y}^{(p)}\) be the output vector of the neural network.

The error function can be introduced as a vector:

$$J = \sum_{p = 1}^N {\left( {y^{\left( p \right)} - d^{\left( p \right)} } \right)} w^{\left( p \right)} x^{\left( p \right)}$$

The neural network correctly separates classes when \(J=0.\) In all other cases, the separating plane is not found.

In the practical implementation of perceptron training, we change the weights according to the given formula until \(J\) becomes as small as possible and no longer changes. If \(J=0\), then our classes were linearly separable and we separated them. If \(J\ne 0\), then the classes were not linearly separable and we found the most appropriate separation of those classes.

Several perceptrons can be combined into a more complex network. Such a structure makes it possible to distinguish more complex classes of objects, such as those that can be separated by a plane or a hyper polygon. Figure 10.15 shows a perceptron network with many input and output neurons.

Fig. 10.15
figure 15

Perceptron network

As the perceptron neural network consists of individual perceptron’s, each of those can be trained separately according to the algorithm described above. In the 1960s, when perceptron networks became very popular, many researchers thought that any intelligent systems could be constructed with the help of perceptron networks. Unfortunately, it later turned out that far from all systems are so simple. When in 1986 elementary McCulloch-Pitts neural networks was replaced by networks with differentiated activation function and an advanced backpropagation algorithm was described, many complicated systems could be modelled by using such neural networks [18].

Backpropagation networks

The backpropagation network is often referred as the direct propagation multilayer perceptron (Fig. 10.16). His training run with the teacher, having a test set, and the teaching algorithm is called a backpropagation algorithm using a gradient descent method to minimize the total squared error.

Fig. 10.16
figure 16

Multilayer perceptron with one hidden layer

The backpropagation algorithm was firstly described in the work of Bryson and Yu-Chi Ho [19], but it did not receive wider recognition until 1986, when Rumelhart et al. [18] published their article. The later period was characterized by a particularly strong development of artificial neural networks and their application.

Using the gradient descent method, it is necessary to differentiate the transfer function with respect to input variables and weights. Thus, nonlinear sigmoidal function or hyperbolic tangent is most commonly used in backpropagation networks. Multilayer perceptron allows the classification of more than just linearly separable classes. Depending on how many neurons are in the hidden layer, we can obtain a separation surface as a convex polygon with approximately as many edges as there are neurons in the second layer.

Once the ANN topology is established, we need to adapt the training algorithm, where a backpropagation algorithm is applied for training of the multilayer perceptron. It consists of two phases: propagation forward and propagation backward.

As the ANN propagates forward, the input variables are transformed layer by layer into output layer variables using fixed weights, thresholds, and transfer functions. In the backpropagation phase, all network weights are recalculated depending on the size of the error signal, which is calculated as the difference between the values of the ANN output variables and the predetermined output vector (teacher).

The opportunity to learn from examples and gain experience has allowed neural networks to be widely used to solve practical problems. Artificial neural networks can help to examine the structure of data, determine its trend, make a forecast, assess risk, or predict impending anomalies. To do this, the neural network must be trained using historical data. The ANN is most commonly used to address classification or clustering challenges because of its greater accuracy and flexibility compared to traditional statistical methods. The most critical challenge for application of the ANN principle in healthcare and other high risk decision domains lays in its “black box” structure: as the model learns from the data set with the labelled output (a teacher) it learns how to estimate the output from the input variables, but it does not provide rules or formulas for clarifying dependencies for decision making. Numerous modifications of the neural network algorithms are proposed in the research works on different conceptual development areas for creating transparency of the ANN performance.

The experimental research

The selected classification task concerns rehabilitation assignment for the patients who have experienced stroke. The experimental analysis was performed for the input data set presented in Table 10.5. As the neural network is a supervised learning algorithm we needed the output variable for training and testing the best performing NN model. Therefore, one more variable of the historical stroke patient database was included, which denotes rehabilitation type prescribed by the expert doctors during the hospitalization. There were four types of rehabilitation therapy (Table 10.7), and the cases with no assignment for rehabilitation were excluded.

Table 10.7 Rehabilitation types

Successful solving of this task leads to creating a neural network model which could forecast the output: propose relevant rehabilitation type according to the health characteristics of the patient. The model could also serve to better plan human resources, as different types of rehabilitation required involvement of different specialists and schedule their time.

The analysis enables to solve what kind of rehabilitation is most likely to be prescribed according to the nine input variables (Table 10.5) serving as health characteristics of the patient.

Several experiments were performed for exploring ANN models performance. In the first step, the neural network models were generated from the data set by applying different algorithms, and three best performing models were retained for further analysis. The second step had to explore the accuracy of the models in solving the classification task by analysis their general performance, and the confusion among the output classes. The third step had to reveal the importance of different variables for building the neural network model. The last step had to investigate classification behaviour at different value ranges of the variables. The last two steps had to provide solution for the “black box” nature of the ANN models. In the healthcare problems the situation of the “black box” is mainly not acceptable, as it means that the ANN model may just advice the output without providing rules or explanatory insights, therefore many modifications and solutions of the ANN algorithms are being explored for converting ANN to “grey box” or “white box”.

The STATISICA for Windows software was used to design the neural network. The data set was randomly split to three subsets used for training (70%), selecting evaluation set (15%), and testing (15%).

Three models were retained (Fig. 10.17), we can see that different algorithms, such as Single Layer perceptron NN, MLP (multilayer perceptron), RBF (Radial basis function), had similar performance. The MLP model was the most accurate in the training stage (0,95), whereas the RBF was slightly better in the testing stage, which may indicate good performance for the unknown new data set. The general classification precision of the models in different stages varied between 0,79 and 0,95 (Fig. 10.17), which indicates good possibility to propose most suitable rehabilitation type. The structure of the neural network models retained is described by their profile data (Fig. 10.17), which denotes number of input variables (9), number of neurons in each hidden layer, and one output variable with the four classification outcomes (Table 10.7).

Fig. 10.17
figure 17

Multilayer perceptron with one hidden layer

As the model aims to correctly select the output value, namely the rehabilitation type, we may explore the performance of different models while assigning particular output values. In Fig. 10.18 the confusion matrix reveals, that the Single Layer NN model had quite significant confusion among classes: it could not assign the rehabilitation types of RWt and RSp to any of the classes, while most of the cases of RSw were wrongly assigned to RPt. Similar confusion problems were demonstrated by the RBF model. Despite similar general accuracy of the models, the best ability to recognize different output classes was shown by MLP model.

Fig. 10.18
figure 18

Confusion matrix

The confusion problem may be determined by different number of cases with various output, used for training the models. In our case the biggest number of cases had the output variable value RPt; or it may be determined the significance of different variables which may be explored by sensitivity analysis of the designed neural network models. In Fig. 10.19 the variables are ranked by calculating ratio of their significance.

Fig. 10.19
figure 19

Sensitivity analysis of the most influential variables

In Fig. 10.19 the sensitivity analysis revealed different importance of the variables for generating ANN models. The most influential variables are shown: for the Single Layer NN the Health complications were ranked 1st, but for the MLP and RBF models the Age and Stroke Symptoms were ranked correspondingly 1st and 2nd. The sensitivity analysis may advice the areas for more detailed investigation and improving precision of the models. It can be achieved by enhancing richness of data in the areas related to the most significant influences and identifying most vulnerable areas of inaccurate performance.

The performance of the MLP model in recognizing values of the output variables denotes strongest reliability of the model for rehabilitation of type RPt (94,6% correct), while the model is not useful for the RWt (30,77% correct) and RSp (29,17% correct). It can be noticed, that relatively small number of cases with different outputs is not the determining factor, as the RSw (43 cases) accuracy is 74,42% whereas RWt had similar number of cases (39) with much lower performance (30,77%) (Fig. 10.20).

Fig. 10.20
figure 20

Sensitivity analysis of the most influential variables

Application of the ANN algorithms and models in healthcare has broad potential due to their computational power, and as the regression, classification or time series-related tasks are important in the healthcare processes related to diagnosis, treatment, rehabilitation and many others. However, the experimental research has demonstrated necessity to apply various approaches not only for building models and analysing their general accuracy, but for their in-depth analysis of performance, influences and possible sources of vulnerabilities and inaccuracies.

6 Conclusion

The chapter provides essential characteristics of methods, traditionally applied for data processing, such as regression analysis, as well as their modifications towards the area of artificial intelligence methods, such as logit, probit models, K-means, Neural networks. The healthcare domain uses variety of data sources and measurement scales, as well as different target requirements for output information. It implies that different methods have to be considered for solving tasks, while the in-depth analysis of the generated solution models may bring to adoption or rejection of different models due to their imbalanced reliability in different classes, segments of cases. The performance of the methods, their analytical power and relevance to the healthcare application domain is illustrated by brief experimental computations for investigation of stroke patient database. Various software tools, such as STATISTICA, Matlab, Google BigQuery ML were applied for analysis, ensuring broad variety of analytical tools for in-depth analysis of generated solutions and deriving new insights for their improvement. The regression analysis, characteristics and the experimental examples of their applications reveal advantages, disadvantages, and causes of irrelevant application of the methods. The analytical tools not only enhance transparency of the artificial intelligence data driven models, but may indicate areas of improving data quality, or initiate potential sources for supplementing enriched data related to the most influential variables characterizing persons and various aspects of healthcare.