1 Introduction

Cardiovascular diseases (CVDs) are the most worldwide spread chronic diseases all over the world and represented the top cause of morbidity and death in the last ten years globally [1]. According to the World Health Organization (WHO), there are 17.9 million people died from CVDs every year representing 32% of all death cases worldwide. So, day after day, the cases of CVDs are at a rapid rate and by 2030, the yearly death rate will increase and reach 22.2 million people approximately [2, 3]. The center of disease control and prevention report confirms the expectations for increasing the mortality rate. It said that every 40 s, one person died due to the CVDs [4]. In Egypt as well, CVDs are considered as the leading cause of death in the last 30 years, and in 2017, the CVDs had an estimated value of 46.2% of all mortality cases [5].

CVDs are an umbrella term including a group of disorders for the heart and blood vessels containing many types such as (1) congestive heart failure, (2) coronary heart disease, (3) congenital heart disease, (4) cerebrovascular disease, and (5) rheumatic heart disease [6]. From five CVDs death, there are four cases resulting from strokes and heart attacks. So, heart disease can be considered as the most life-snatching chronic disease and its risk comes from the silently of the disease. It is not diagnosed until the symptoms of heart failure (or attack) are recognized [7]. In heart disease, the heart fails to do its normal function by supplying blood to other body parts because of the blockage of coronary arteries that are responsible for supplying blood to the heart [8]. The regular heart disease symptoms include (1) breath shortness, (2) body weakness, (3) confusion, and (4) fainting. This disease risk can be increased with people with perilous cases including (1) unhealthy diet, (2) smoking, (3) fitness issues, (4) high blood pressure, (5) exercise deficiency, and (6) high cholesterol level [9].

The early and accurate prediction for heart disease is very crucial to enhance the survival rate and reduce the mortality rate. This will help healthcare professionals in their decisions by providing an accurate and efficient diagnosis and treatment for patients to save their lives [10]. One of the approaches for the early and accurate prediction for heart disease is machine intelligence. This can be achieved using machine learning (ML) algorithms and deep learning (DL) approaches [11]. Various types of heart disease data such as images, waves, and sounds can be used to perform that [12].

Image data can be analyzed and the features are extracted, to train the ML (or DL) approach such as CNN, to determine if the images are belonging to a diseased or healthy patient [13]. Detecting heart disease can also be through gathering features obtained from cardiac sounds to be the input for a DL or ML algorithm beside those cardiac sounds can be converted into numerical data to be utilized as an input for a DL approach to check the patient condition if it had heart disease or not [14]. Another type of data which have been deployed for heart disease detection and know the patient condition through analyzing Electrocardiogram (ECG) and Electroencephalogram (EEG) waves to assign the correct label to be the input for the Recurrent Neural Network (RNN) model or extracting features from the signals and convert them into a numerical data which had been used as the input for ML algorithm [15]. ML algorithms such as support vector machines and decision trees have an essential role in predicting the existence of heart disease accurately by analyzing the medical data whether its voice or images numerically [16,17,18]. Also, DL approaches such as convolutional neural networks (CNN) can analyze them efficiently and can deal with large datasets [19,20,21].

1.1 Paper contributions

The current study focuses mainly on developing a hybrid system from ML algorithms and CNN models to predict and detect the existence of heart disease accurately based on the analysis of the medical voice records and images. The suggested approach will aid healthcare professionals to improve the provided medical care to patients. The contributions of the current study can be summarized in the following points:

  • Proposing a hybrid system from ML algorithms and DL approach for predicting heart disease.

  • Analyzing different types of datasets including medical images and voice records.

  • Suggesting a hybrid DL and AO approach for the learning and optimization processes.

  • Reporting state-of-the-art performance metrics compared with other related studies and approaches.

1.2 Paper organization

The rest of this paper is organized as follows: In the next section, the related studies that have heart disease diagnosis and prediction processes contributions are described. Section 3 depicts the basic concepts regarding voice feature techniques, ML algorithms, Convolutional Neural Network (CNN), Metaheuristic Optimization using Aquila Optimizer (AO), Image Augmentation, and Data Normalization. In Sect. 4, the suggested approach of this work during the heart disease learning and optimization phase is discussed. Section 5 illustrates the experiments and the reported results of different approaches. Finally, Sect. 6 represents the main conclusion and future work.

2 Related work

In this section, the existing studies and research papers, related to heart disease diagnosis and prediction processes based on various types of medical data, are introduced. The related studies are split into studies that focused on (1) deep learning approaches, (2) machine learning algorithms, and (3) hybrid approach.

2.1 Deep learning-based studies

Brunese et al. [22] proposed a methodology for detecting heart disease using DL and through cardio sounds. They used deep neural networks (DNN) to extract a set of features and analyzed the cardio sounds. They showed if they belonged to healthy patients or those with heart disease. 176 heartbeats were considered when they performed the experiments and their results showed that 145 of them related to heart disease patients and only 31 heartbeats for healthy patients. The overall accuracy was 98%. Miao et al. [23] developed a DNN for predicting and diagnosing coronary heart disease. They used Multi-Layer Perceptron (MLP), regularization, and dropout. They utilized 303 instances containing attributes from patients at the Cleveland clinic foundation and achieved 83.67% accuracy and 93.51% sensitivity.

Abdel-Alim et al. [24] proposed a heart disease diagnosis system using ANN and through classifying several cases related to heart disorders using heart sounds. The used dataset contained 850 cases that were partitioned into 650 cases for the ANN training process and the other 200 cases for testing. They utilized different techniques to perform the diagnosis process such as (1) Fast Fourier Transform, (2) Discrete Wavelet Transfer, and (3) Linear Prediction Coding. They achieved a recognition rate of 95%. Ali et al. [19] suggested a smart monitoring system for predicting heart disease through DL approaches, feature selection, feature fusion, and weighting techniques on the used Cleveland and Hungarian datasets. The proposed approach achieved an accuracy of 98.5%. Zhang et al. [25] carried out ECG classification using CNNs to identify heart disease. They deployed the process using a dataset consisting of 102,548 heartbeats and achieved 97.7%, 97.6%, and 97.6% for positive predictive rate, sensitivity, and F1-score, respectively.

Zhang et al. [26] suggested an approach for diagnosing heart disease through signal processing and DL models which predicted the disease from the ECG signals. The used dataset contained 8524 single lead episodic ECG records and reported 0.87 on the F1-score performance metric. Kwon et al. [27] developed a DL approach for mortality rate prediction among heart disease patients from their ECG. The result showed that there are 1026 patients with a mortality rate among 25,776 cases and that confirmed the model achieved accurate results compared to existing or previous ML models. Sajeev et al. [28] proposed an approach for a heart disease prediction system that depended on DL models. It could determine the probabilities of disease risk on the patients. After applying the performance metrics, they achieved an accuracy of 94% and an Area Under the Curve (AUC) score of 0.964.

Rath et al. [29] carried out heart disease detection on the ECG samples through the DL model. The utilized model was depending on LSTM and Generative Adversarial Network (GAN) to achieve the best efficiency. The results reported the best accuracy of 99.2%, F1-score of 0.987, and AUC score of 0.984. Darmawahyuni et al. [30] developed a framework for detecting coronary heart disease based on DNN and UCI repository heart disease dataset. They achieved a specificity of 92%, sensitivity of 99%, and accuracy of 96%.

2.2 Machine learning-based studies

Jindal et al. [31] proposed a heart disease prediction system using ML algorithms to predict the condition of the patient and determine if it had heart disease or not. They depended on the medical history of each patient in a dataset that contained 13 medical attributes for 304 patients which were collected from the UCI repository. They used ML algorithms such as (1) K-Nearest Neighbor (KNN), (2) Logistic Regression (LR), and (3) Random Forest Classifier (RFC). From these algorithms, KNN achieved the best accuracy with a value of 88.52%. Also, they built a model from the used ML algorithms and it achieved 87.5% accuracy which was better compared to their related studies.

Muhammad et al. [32] developed an intelligent computational model for the early and accurate detection and diagnosis of heart disease based on ML algorithms. They utilized many ML algorithms such as (1) RFC, (2) Artificial Neural Network (ANN), (3) Support Vector Machine (SVM), (4) LR, (5) KNN, (6) Naïve Bayes (NB), (7) Extra-Tree Classifier (ETC), (8) Gradient Boosting (GB), (9) AdaBoost (AB), and (10) Decision Tree (DT) on the Cleveland and Hungarian heart disease datasets that were available on the UCI repository. They made a comparison between the algorithms and utilized performance evaluation metrics to show the best algorithms. They were the ETC and GB with overall accuracy values 94.41% and 93.36% respectively.

Pugazhenthi et al. [33] developed a framework for detecting ischemic heart disease from medical images using ML algorithms such as (1) MLP, (2) SVM, and (3) C5 classifier. The reported results showed that the highest accuracy was reported by SVM with an accuracy of 92.1%. Alarsan et al. [34] developed an approach of heart disease detection based on ECG classification and using ML algorithms to extract features that were required for the classification process. They used a dataset that contained 205,146 records for 51 patients. They deployed ML algorithms such as (1) RFC, (2) DT, and (3) Gradient-Boosted Trees (GDB). The highest accuracy was for the GDB algorithm which was 97.98%. Nikhar et al. [35] proposed a methodology for predicting heart disease using ML algorithms on the Cleveland heart disease database which contains 303 records with 76 medical attributes. They performed the experiments using NB and DT which achieved the highest accuracies.

Patel et al. [36] performed a heart disease prediction system by utilizing ML algorithms and data mining techniques on the Cleveland database of UCI repository which had 303 instances. The used algorithms are RFC and Logistic Model Tree (LMT) that performed the prediction process effectively. Singh et al. [37] proposed a prediction system for heart disease using ML approaches. They made a comparison between various ML algorithms such as KNN, SVM, DT, and LR on a dataset collected from the UCI repository. The results showed that the highest accuracy was achieved by KNN with an overall accuracy of 87%, SVM with 83%, DT with 79%, and LR with 78%. Krishnan et al. [38] proposed a prediction system for the probabilities of heart disease based on ML approaches such as DT and NB. They used data from the UCI repository that contained 300 instances with 14 clinical parameters. The DT algorithm had the highest accuracy with 91%.

2.3 Hybrid-based studies

Pasha et al. [39] proposed a framework for predicting cardiovascular disease using DL techniques and different algorithms such as (1) SVM, (2) DT, (3) KNN, and (4) ANN. They collected the dataset that contained attributes related to heart disease from Kaggle. In their work, they made a comparison between the algorithms to know the most optimum one which was ANN with an overall accuracy value of 85.24%. Raza et al. [40] developed a framework for classifying heartbeat sound signals using DL approaches. They utilized a recurrent neural network (RNN) that worked depending on the long short-term memory (LSTM), dense, dropout, and SoftMax layers. They also deployed the MLP, DT, and RFC models. The result showed that RNN is the most efficient one from them and reported an accuracy value of 80.80%. Arabasadi et al. [10] proposed a Computer-Aided System (CAS) for heart disease detection based on a hybrid model using Neural Networks (NN) and Genetic Algorithms (GA). They used a dataset containing information of 303 patients and achieved 93.85% accuracy, 97% sensitivity, and 92% specificity.

Sajja et al. [41] proposed a DL approach for the early prediction of cardiovascular diseases depending on CNNs. They used a dataset from the UCI repository and made a comparison between the traditional algorithms like (1) LR, (2) KNN, (3) SVM, (4) NB, (5) NN, and (6) the proposed approach which reported the best accuracy with 94.78%. Haq et al. [42] proposed a framework of a hybrid intelligent system for the prediction of heart disease based on ML algorithms to identify healthy people and heart disease patients through analyzing the used Cleveland heart disease dataset. They utilized 3 feature selection algorithms, 7 classifiers performance evaluation metrics, and the cross-validation method. The result showed that the best-used algorithms are LR and SVM with accuracies of 89% and 88% respectively. Gavhane et al. [43] suggested a prediction framework for heart disease based on symptoms and using ML algorithms such as NN and MLP. The results showed that NN was the most accurate algorithm when applied to the prediction process. Sharma et al. [44] suggested a framework for heart disease prediction using DNN on the heart disease UCI repository. They utilized different algorithms like (1) KNN, (2) SVM, (3) NB, and (4) RFC for the classification process. They used Talos optimization with DNN which led to achieving the best accuracy of 90.76%.

2.4 Related studies summarization

Table 1 summarizes the discussed related studies in 2021 and 2020 while Table 2 summarizes the discussed related studies otherwise (i.e., 2019 or before). They are ordered in descending order according to the publication year.

Table 1 Related Studies in (2021 and 2020) Summarization
Table 2 Related Studies in (2019 or before) Summarization

2.5 Plan of solution

The current study proposes a hybrid approach for heart disease learning and optimization through various phases (as shown in Fig. 1). In it, the first phase handles the dataset collection for classifying heart sounds challenge dataset and medical images. The second phase is preprocessing that data. It includes data augmentation and scales conversion techniques. The third phase is the optimization phase which involves the structure, learning process, and data augmentation approaches by utilizing the pre-trained CNN models and ML algorithms. The fourth phase includes the numerical and graphical features extraction techniques. The fifth phase represents the classification process that is based on DL approaches via transfer learning and ML algorithms. Finally, The sixth phase involves measuring the performance of ML and DL approaches through various experiments and calculated performance metrics.

Fig. 1
figure 1

The Suggested Framework Parts Summarization

3 Background

This section provides the main background that can help the reader get in touch with the parts of the suggested hybrid approach. It is divided into the following points:

  • Voice feature extraction techniques.

  • Machine learning algorithms.

  • Convolutional neural network.

  • Metaheuristic optimization using aquila optimizer (AO).

  • Image augmentation.

  • Data normalization.

3.1 Voice feature extraction techniques

Feature extraction is one of the important steps in the learning process of the algorithms through minimizing calculations, the choice of the most optimum features in the dataset, and the choice of the required information to train the model with it [45]. The features can be extracted from different types of data such as audio, images, and waves [46]. In the current study, the features are extracted numerically and graphically from the audio records. There are a lot of audio features extraction techniques but the used ones in the current study are (1) Mel-Frequency Cepstral Coefficients (MFCC), (2) Mel-Spectrogram, (3) Zero Crossing Rate (ZCR), (4) Root Mean Square Energy (RMSE), (5) Spectral-based, (6) Tonnetz, and (7) Chroma-based techniques [47].

3.1.1 Mel-frequency cepstral coefficients (MFCC)

Mel-Frequency Cepstral Coefficients (MFCC) is the most common feature extraction technique used for extracting audio features and graphical features [48]. In MFCC, the signal is being framed and the Hamming window is used to reshape the signal to a very small window [49]. Figure 2 shows the steps of extracting the MFCC features [50]. The MFCC uses the Discrete cosine transform (DCT) internally. If the DCT type is 3, then it is named MFCC with the HTK-style while if the DCT type is 2, then it is named MFCC with the Slaney-style [51].

Fig. 2
figure 2

The Steps of Extracting the MFCC Features from an Audio Record

3.1.2 Mel-spectrogram (MS)

Mel-Spectrogram (MS) is one of the most efficient techniques for audio processing, extracting features from audios, and transferring them into feature images [52]. Figure 3 shows the steps of extracting the MS features [53].

Fig. 3
figure 3

The Steps of Extracting the MS Features from an Audio Record

3.1.3 Zero-crossing rate (ZCR)

Zero-Crossing Rate (ZCR) is one of the feature extraction techniques in which the signal changed from positive to zero to negative or vice versa for recognizing the voiced and unvoiced signals. ZCR is based on the idea of counting the times where the waves go from positive to negative or vice versa at a specific time [54]. Equation 1 shows how to calculate the ZCR value.

$$\begin{aligned} \text {ZCR}=\frac{1}{\left( 2\times M\right) }\times \sum _{k=1}^{M}{|\hbox {sign}\left( a[k]\right) -\hbox {sign}\left( a[k-1]\right) |} \end{aligned}$$
(1)

where k is an index, M is the size and sign(a[k]) can be calculated using Eq. 2.

$$\begin{aligned} \text {sign}\left( a\left[ k\right] \right) = {\left\{ \begin{array}{ll} 1, &{} \text {if } k \ge 0\\ -1, &{} \text {if } k<0 \end{array}\right. } \end{aligned}$$
(2)

3.1.4 Chroma-based techniques

There are many chroma-based techniques but the used ones in the current study are (1) chroma-only, (2) Short-Time Fourier Transform (STFT), (3) Constant-Q chromagram Transform (CQT), and (4) Chroma Energy Normalized Statistics (CENS) techniques. Short-Time Fourier Transform (STFT) is the sequence of Fourier transforms for an audio signal that allows performing time-frequency analysis for the situations in which signals frequency components change over time. It is a fixed resolution method for analyzing fixed signals and segmenting them into time intervals to take the Fourier Transform for every segment in the signal [55].

Constant-Q chromagram Transform (CQT) is the wavelet transform technique that transforms the time domain signal to the time-frequency domain. The center frequencies of frequency bins are spaced and their Q-factors are equal. The frequency resolution will be better for low frequencies whereas the time resolution is better for high frequencies. The CQT has a better result when the logarithmic frequency mapping and low frequencies are being concerned [56]. Chroma Energy Normalized Statistics (CENS) is the group of scalable sound features utilized for the sound matching process. It computes the short-time energy spread signal. CENS is used for extracting chroma features that capture the melodic and harmonic characteristics of sounds and represent a short time window of the sound [57].

3.1.5 Root mean square energy (RMSE)

It is the square root of the average of the summation of signal amplitude for the short time sound wave energy. RMSE plays an essential role as a loudness indicator. The higher the energy, the louder the sound. RMSE has been utilized in sound segmentation and genre classification [58]. Computing the RMSE value from the voice records is faster as it does not require any STFT calculations. However, using a spectrogram can give a more accurate representation of the energy over time as its frames can be windowed. Equation 3 shows how to calculate the RMSE value.

$$\begin{aligned} \text {RMSE}=\sqrt{\frac{1}{N}\times \sum _{k=1}{N}{x_k^2}} \end{aligned}$$
(3)

where N is the number of samples and x is the sampled signal.

3.1.6 Tonnetz

Tonnetz is used to analyze the tonal centroid features and audio signals to learn the features that are being extracted from the audio files [59].

3.1.7 Spectral-based techniques

There are many spectral-based techniques but the used ones in the current study are (1) spectral centroid, (2) spectral bandwidth, (3) spectral contrast, (4) spectral flatness, and (5) roll-off frequency techniques. Spectral Centroid is the measure of the spectral shape and position. It measures the shape of the spectrum for waves to characterize it. It predicts the sound brightness and the frequency band where most of the energy is concentrated. Hence, the high value of spectral centroid refers to the more signal energy to be concentrated within the higher level of frequencies [60].

Spectral Bandwidth is the spectral range of interest around the centroid. It is derived from the spectral centroid. The bandwidth is directly proportional to the spreading energy around the frequency bands. Also, it is the weighted distances mean of frequency bands that are derived from the spectral centroid [61]. Spectral Contrast is the difference between valleys and peaks in the spectrum. It contains more spectral information and represents the relative spectral characteristics. It makes sound normalization and keeps the most peaks for a sound signal constant whereas making valleys attenuation in the spectrum [62].

Spectral Flatness is the estimation of audio spectrum characterization and signal energy distribution uniformity and noisiness of energy spectrum in the frequency domain. If the spectral flatness has a high value then it indicates that the spectrum has the same energy for all spectrum bands. Also, if the spectral flatness is low this means that the spectral energy has low uniformity in the frequency domain [63]. Roll-off Frequency is the frequency in which 95% of the energy for each signal is below that frequency. It is used to differentiate between unvoiced and voiced speech. The unvoiced speech has a high level of energy in the high frequency of spectrum [64].

3.2 Machine learning algorithms

ML algorithms are programs with a specific style of adjusting the parameters (i.e., weights) and have feedback depending on their previous experience in predicting the related dataset [65]. In this work, five ML algorithms are deployed to detect heart disease existence. They are (1) KNN, (2) DT, (3) AB, (4) RFC, and (5) Extra Trees Classifier (ETC) ML algorithms.

3.2.1 K-nearest neighbour (KNN)

It is one of the most used ML algorithms for versatile problems such as regression and classification problems but it is commonly utilized in classification cases [66]. KNN is one of the most simple and easy algorithms to be implemented. However, it is computationally expensive [67]. The working idea of KNN is based on storing the available cases and classifying new ones referring to the majority votes of its k-neighbors [68]. It is measured by the distance function to find the distances between a query and all cases in the data. After that, it chooses the closest one to the query and takes the most frequent label between them [69]. The “k” in the KNN algorithm represents the nearest neighbors numbers that are utilized for dealing with new cases. If the “k” value is high, then it overlooks the cases with a little sample. If the “k” value is low, this means it can be referring to the outliers [70].

3.2.2 Decision trees (DT)

It is a decision support technique that has a structure like a tree and consists of three parts. They are (1) leaf nodes, (2) root nodes, and (3) decision nodes [71]. The algorithm splits the training dataset into various branches that segregate to other branches. The nodes in the DT represent the attributes for predicting the outcome and the decision nodes provide the link into the leaves (as shown in Fig. 4).

Fig. 4
figure 4

A sample of the decision trees (DT) with its components

The decision nodes and root nodes represent the features in the dataset [72]. Hence, the DT algorithm provides various outputs and the highest one will be selected as a final output. From the DT algorithm, a model, that can predict the target variable value from learning decision rules inferred from training data, can be built [73]. The tree representation of the algorithm helps to understand the problem and reach the most optimum solution. So, it is represented as one of the easiest and simplest models for implementation [71].

3.2.3 Random forest classifier (RFC)

It is an ML algorithm representing a collection of DTs so it combines multi-DT outputs to have the most accurate single solution. If a new case that depends on the attributes of DT is required to be classified, each tree in it will give a classification and say votes for this case. Then, the forest selects the highest votes between the trees [74]. RFC algorithm is very flexible, easy to implement and understand, and can achieve a stable prediction output [75]. The algorithm makes the training process based on the bagging method by combining the learning models that will improve the overall result [75]. Figure 5 shows a sample of the RFC and its inner components.

Fig. 5
figure 5

A Sample of the Random Forest Classifier (RFC) with its Components

3.2.4 Extra trees classifier (ETC)

It is an ML ensemble algorithm that combines various predictions from many decision trees by averaging them in the case of regression tasks or through utilizing the majority votes for classification problems. ETC is related to Random Forests and bagging. Until the model performance is stable, the number of additional trees is added to increase the performance and aggregate the predictions of various trees to have the most optimum one [76]. ETC is one of the fastest and most accurate ML algorithms that is based on randomization and optimization [77].

3.2.5 AdaBoost (AB)

It is a boosting approach that is utilized as an ensemble algorithm in ML and a supervisory layer for other algorithms. Its work depends on the learning growing sequentially approach. It is done by building the model from training data and making another model that attempts to correct any errors that occurred on the first model. Then, the models are mixed until the training set makes the prediction efficiently or the maximum number of models are inserted [78]. AdaBoost changed the set of weak classifiers to the strong classifiers and predictions were performed based on the average weights of the weak classifiers. AdaBoost depends on the stump performance by updating weights and changing the training set depending on the result of previous ones [79].

3.3 Convolutional neural network (CNN)

Convolutional Neural Network (CNN) is classified as one of the most deep learning powerful tools that take an input image, extract features from it using filters (or kernels), and transfer it to lower dimensions without losing any information. CNN demonstrated its ability for classifying images effectively so it is the most popular one used for that as it can learn the intrinsic and latent image features [80]. CNN model has multi-layers starting by the input layer to convolutional layer, pooling layer, fully connected layer, batch normalization layer, activation layer, and ending with the output layer [81]. A CNN’s architecture is composed of multilayers as follows:

Input Layer contains the input image and holds its pixel values. Convolutional layer is applied to the input image and extracts different levels of features using kernels and filters that have specific widths and heights. It will determine the output neurons that are connected to a specific region of the input data. Multiple convolution operations are applied by sliding the filters on the input to extract various features levels from the image and stack them to make the convolutional layer output [82]. Pooling layer down-samples the input image and reduces the parameters existing in the image aiming to decrease the training time and reduce the overfitting without losing important data. It can also affect the performance of the training process [83, 84].

Fully connected layer (FC) represents a flattened feed-forward layer that aids for classification processes after the pooling process. After the down-sampling and feature extraction processes, nonlinear combinations of features are learned as the output of the convolutional layer. All neurons in the fully connected layer are connected to the neurons in the last and next layer. It can have a nonlinear activation function to make predictions and classify the input data to different classes [85, 86]. Batch normalization layer is one of the main layers in the CNN architecture that makes the model perform better and the training process faster by (1) allowing an extensive range of learning rates and (2) re-parametrizing the optimization problems that lead to the process being more stable, smoother, and avoid the local minimum convergence [87]. Activation layer is required to get the output of the node through using one of the different activation or transfer functions including Sigmoid, Hyperbolic Tangent (Tanh), Rectified Linear Unit (ReLU), Leaky ReLU, Exponential Linear Unit, Scaled Exponential Linear Unit, and SoftMax functions [88, 89].

3.3.1 Transfer learning

It is representing an ML concept where the pre-trained model that is used for a specific task is reused for another task. It can be summarized as knowledge transfer [90]. The main idea about reusing the pre-trained models for a new task is to have a starting point and have a lot of labeled training data in a new one that did not have much data instead of building the model from scratch and creating these labeled data which is very expensive [91]. Transfer Learning is very popular in the DL field through its advantages including better performance and saving much time during the training process that can lead to rapid progress [92]. Many pre-trained CNN models were trained on the ImageNet image database [93] but the used ones in the current study are VGG16, VGG19, ResNet50, ResNet101, MobileNet, MobileNetV2, MobileNetV3Small, and MobileNetV3Large.

VGG is one of the popular pre-trained models that is used for image classification because of its simplicity. There are many different versions of the VGG architecture published by Oxford University researchers [94]. Although of the model simplicity, that is very expensive in computational and memory cost. The current study uses two types of VGG depending on the layers which are VGG16 and VGG19. VGG has a competitive advantage over other models representing in using only \(3 \times 3\) convolution filters. VGG model achieved 9.9% top-five error on ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [95]. ResNet is a pre-trained deep residual network model proposed by the Microsoft Research team on the “Deep Residual Learning for Image Recognition” [96]. ResNet is a very deep, easy optimization model, and can increase the accuracy and depth of the model. It used forward and backward propagation techniques and the ReLU activation function [97]. ResNet achieved a 7.8% top-five error on ILSVRC.

MobileNet is one of the recently proposed pre-trained models with many modifications and advantages over the previous models. It is proposed by the Google Research team [98, 99]. It is suitable for mobile and embedded applications. It consists of 2 blocks with 3 layers on each block including a residual block with one stride and other blocks with two strides. It is utilized depth-wise separable convolution modules [100]. It can handle many tasks at the same time and it is considered as the smallest memory size compared to other models. So, it is simple with no complexity or many parameters that can affect the overall performance. The model achieved a 0.901 top-five accuracy on ILSVRC [101]. Table 3 compares the discussed and used pre-trained CNN models.

Table 3 Comparing the used pre-trained CNN models

3.3.2 Parameters optimization

It represents an expanded method for changing parameters (i.e., weights) of the model to report better accurate results and reduce the losses [103]. The used parameters optimizers in the current study are Adam [104], NAdam [105], AdaGrad [106], AdaDelta [107], AdaMax [108], RMSProp [109], SGD [110], Ftrl [111], SGD Nesterov [112], RMSProp Centered [113], and Adam AMSGrad [114].

3.3.3 Hyperparameters

Loss Function has a critical role in evaluating the proposed solution and calculating the model errors [115]. So, how good the model is determined, to try to change the parameters, to improve the model performance and minimize the overall loss. It can be started as the penalty for failing to reach the desired output so if the deviation in the predicted value by the model from the desired value is large, then the function will give a high loss value and a smaller number otherwise [116]. The used losses in the current study are Categorical Crossentropy [117], Categorical Hinge [118], KLDivergence [119], Poisson [120], Squared Hinge [121], and Hinge [122].

Batch Size represents the number of data records that are utilized to train the model in every iteration to ensure the model generalization, parameters value, and the convergence of loss function. It plays an important role in the learning process of the model by making it quicker and more stable [123]. Dropout is a regularization technique used for training the CNN model on any or all hidden layers of the architecture. It plays an important role in preventing the overfitting problem and addressing it to keep the performance at an optimum level. It can improve the generalization efficiency in all the data by setting randomly the output to be 0 for the given neuron [124].

3.4 Metaheuristic optimization using aquila optimizer (AO)

It is one of the most popular choices for optimization, modeling, and solving complex problems that are difficult to be solved using the traditional ways. “Meta” in Metaheuristic refers to the higher level that performs better than the simple heuristics. It utilizes a tradeoff for the global exploration and local search [125]. Metaheuristic algorithms have essential parts which are diversification and intensification. Diversification generates various solutions for exploring the search space whereas intensification focuses on the search in a local region through exploiting information where the good solution is found in this region. Metaheuristic optimization is utilized to find the optimal solution for many optimization problems that are very challenging functions based on the correct use of the optimum algorithm for this case [126].

Aquila Optimizer (AO) is a novel metaheuristic optimization method. The optimization process for the AO algorithm is presented in four ways which are (1) choosing the search space through high soar by the vertical stoop (Eq. 4), (2) discovering the various search space through contour flight by short glide attack (Eq. 5), (3) swooping by grabbing prey and walk (Eq. 7), and (4) exploiting through converge search space by low flight by descent attack (Eq. 6).

$$\begin{aligned} X\left( t+1\right)= & {} X_\mathrm{best}\left( t\right) \times \left( 1-\frac{t}{T}\right) \nonumber \\&+ \left( \frac{\sum _{i=1}^{N}{X\left( t\right) }}{N} - X_\mathrm{best}\left( t\right) \times \hbox {rand} \right) \end{aligned}$$
(4)
$$\begin{aligned} X\left( t+1\right)= & {} X_\mathrm{best}\left( t\right) \times \text {Levy}\left( D\right) + X_R\left( t\right) + \left( r_1 + U \times D_1\right) \nonumber \\&\times \left( \cos \left( -\omega \times D_1 + 1.5 \times \pi \right) - \sin \left( -\omega \times D_1\right. \right. \nonumber \\&\left. \left. + 1.5 \times \pi \right) \right) \times \hbox {rand} \end{aligned}$$
(5)
$$\begin{aligned} X\left( t+1\right)= & {} \left( X_\mathrm{best}\left( t\right) - \frac{\sum _{i=1}^{N}{X\left( t\right) }}{N}\right) \times \alpha - \hbox {rand} \nonumber \\&+ \left( \left( UB - LB\right) \times \hbox {rand} + LB\right) \times \sigma \end{aligned}$$
(6)
$$\begin{aligned} X\left( t+1\right)&= {} QF \times X_\mathrm{best}\left( t\right) - X\left( t\right) \times \hbox {rand} \times \left( 2 \times \hbox {rand} - 1\right) \nonumber \\&\quad-\text {Levy}\left( D\right) \times 2 \times \left( 1 - \frac{t}{T}\right) \\&\quad + \hbox {rand} \times \left( 2 \times \hbox {rand} - 1\right) \end{aligned}$$
(7)

where \(X(t+1)\) is the solution of the next iteration, N is the population size, t is the iteration number, T is the total number of iterations, rand is a random number in the range [0, 1], \(X_R(t)\) is a random solution in the current iteration t, \(X_\mathrm{best}(t)\) is the best solution in the current iteration t, D is the dimension space size, \(\text {Levy}(D)\) is the levy flight distribution function, \(r_1\) is a value in the range [1, 20], U equals 0.00565, \(D_1\) is a value in the range [1, D], QF is the quality function, \(\alpha \) (and \(\sigma \)) equal to 0.1, UB is the upper bound, and LB is the lower bound. The fixed values are taken from the original AO paper.

The optimization procedures in AO start by generating a random predefined set of candidate solutions that are called population and by the repetition trajectory, the AO search strategies explore the positions of the best solution (or the near-optimal one). Every solution updates its position depending on the best solution in the optimization procedure of the AO [127]. The series of experiments are conducted to enable the AO to validate the optimizer’s ability to find the best solution for various optimization tasks. AO performance can be enhanced through combining it with a flight, mutation, levy, stochastic (and evolutionary) components, and global (or local) search [128].

3.5 Image augmentation

Data Augmentation (DA) is a process of augmenting the dataset and increasing it to a large, rich, and diverse one [129]. DA can increase the performance of the CNN model through generalization and the variety of the data that enables the model to detect or classify any objects on the image in different orientations and dimensions [130]. This process is representing a pre-processing step as it is applied only to the training subset of the dataset to increase its size and variations. DA can be performed using different transformation techniques including (1) flipping the image, (2) zooming it in or out, (3) rotating the image by a specific degree, (4) shifting the image, (5) cropping the image, (6) changing the brightness of the image, and (7) shearing the image horizontally (or vertically) [131].

Flipping can be done by flipping the image vertically (or horizontally) depending on the object’s location on the image. Rotation can be done by rotating the image to a specific degree. Shearing is done by shifting any part of the image. Cropping can be applied by removing any columns (or rows) of pixels from the image to see the object in different locations. Shifting is done by moving the pixels of width (and height) of the image in only one direction vertically (or horizontally) without affecting the dimensions of the image. Brightness changing is performed by changing the image and making it lighter (or darker) to enable the model to recognize different lighting levels in the image. Zooming is done by zooming the image in (or out) within a specific range. Also, it can be applied to each axis of the image independently [132].

3.6 Data normalization (DN)

Data Normalization (DN) is one of the pre-processing techniques that change the attribute value to a known range or scale to improve the performance of ML algorithms. There are different DT techniques but the used ones in the current study are (1) Standard Scaler, (2) Min-Max Scaler, (3) Max-Abs Scaler, and (4) Normalization [133].

3.6.1 Standard scaler

It is one of the DN techniques deployed on the vectors. It standardizes the features by making the mean equal to zero and scaling each vector into the unit variance. Equation 8 shows how to calculate it.

$$ {\text{out}} = \frac{{{\text{in-mean}}}}{{{\text{std}}}} $$
(8)

where out is the output image, in is the input image, mean is the mean value and std is the standard deviation.

3.6.2 Min-max scaler

It transforms the dataset values into a range between 0 and 1 where the smallest value is normalized into 0 and the largest value is normalized into 1. Equation 9 shows how to calculate it.

$${\text{out}} = \frac{{{\text{in}}- {\text{in}}_{{{\text{min}}}} }}{{{\text{in}}_{{{\text{max}}}} - {\text{in}}_{{{\text{min}}}} }}$$
(9)

where \(in_{max}\) is the maximum value and \(in_{min}\) is the minimum value.

3.6.3 Max-abs scaler

It is similar to the min-max scaler except the values are mapped into the range between 0 and 1 as it scales and translates the data features to the range between -1 and 1 by dividing it by the maximum absolute value. The maximum value for any feature equals 1. Equation 10 shows how to calculate it.

$$ {\text{out}} = \frac{{{\text{in}}}}{{{\text{|in}}_{{\max }} |}} $$
(10)

3.6.4 Normalization

It is deployed by squeezing the data between 0 and 1 It is very useful in classification and data containing negative values [134]. Equation 11 shows how to calculate it.

$$ {\text{out}} = \frac{{{\text{in}}}}{{{\text{in}}_{{{\text{max}}}} }} $$
(11)

4 Suggested approach

The current study suggests a framework for heart disease learning and optimization. It is divided into four major phases (or layers). They are (1) dataset collection, (2) pre-processing (segmentation and features extraction), (3) learning and hyperparameters optimization, and (4) export and statistics phases. The framework flow is summarized in Fig. 6

Fig. 6
figure 6

The 4-Phases Suggested Framework Flow Summarization

In summary, the input layer accepts the voice records. These records flow sequentially to the pre-processing phase which is partitioned into two sub-phases. Its target is to segment the records into sub-records with equal time durations and extract the features from them numerically and graphically. These features and graphs are the inputs of the third phase. Its role is to learn and optimize the selected model. The optimized model will be exported in the last phase. Also, training, validation, and testing statistics and figures are exported in that phase. The phases are discussed in the following subsections.

4.1 Dataset collection phase

Classifying Heart Sounds Challenge Dataset [135] is the used dataset in the current study. It contains two challenges. The authors combined both of them into one dataset. It has five classes (i.e., categories). They are (1) Murmur, (2) Normal, (3) Artifact, (4) Extra Heart Sound, and (5) Extrasystole. The data reflects voice records with the extensions “wav”, “aif”, and “aiff”. Table 4 shows the categories and the corresponding number of records.

Table 4 The used dataset classes and the corresponding number of records

4.2 Pre-processing phase

The dataset is pre-processed in two sub-phases. The first sub-phase is segmenting the records into sub-records with equal time durations. The second sub-phase is to extract the features numerically for the ML used techniques and graphically for the pre-trained CNN models.

4.2.1 Voice segmentation

The records should be segmented to a fixed time duration such as 1 s or 3 s. The suggested approach in the current study is to segment the records in different time durations in both directions and concatenate them. How does this happen? For each record in the dataset, a specified time window moves from the beginning on it and segment the record into sub-records. For example, if the record’s duration is 9 seconds and the allowed time window is 1 second. Then, there are 9 generated sub-records. Also, if the allowed time window is 2 seconds. Then, there are 4 generated sub-records. What about the remaining small-time segment? It is ignored. In the last example, there are only 4 generated sub-records and hence the remaining 1 second is neglected as it is smaller than the allowed time window (i.e., 2 seconds). How to get the number of segments? Eq. 12 shows how to get the number of segments for a record.

$$\begin{aligned} \hbox {no}_\mathrm{segments}=\Bigl \lfloor \left( \frac{\mathrm{duration}_\mathrm{record}}{\mathrm{duration}_\mathrm{window}}\right) \Bigr \rfloor \end{aligned}$$
(12)

Algorithm 1 shows the followed pseudocode function during the segmentation process for a single record.

figure a

What is the suggested allowed time durations for segmentation? It is worth mentioning that the current study suggests segmenting each record with 1s, 3s, 5s, 7s, and 9s time durations in both directions. Also, the whole segmented records are concatenated in a single dataset. How the segmentation is done in both directions? For each sub-record, the voice is reversed. Hence, two sub-records from a single one can be generated. Figure 7 summarizes this process. The overall number of generated datasets is 11 (i.e., 2 for each time duration and 1 concatenated).

Fig. 7
figure 7

Example on the segmentation process on a single record for a specific allowed time duration

4.2.2 Numerical features pre-processing

The numerical features are extracted from each segment in each record. The used numerical voice features extraction techniques in the current study are (1) MFCC (HTK-style and Slaney-style), (2) Mel-Spectrogram, (3) ZCR, (4) RMSE, (5) Spectral-based (spectral centroid, spectral bandwidth, spectral contrast, spectral flatness, and roll-off frequency), (6) Tonnetz (normal and harmonic), and (7) Chroma-based (chroma-only, STFT, CQT, and CENS) techniques. Table 5 shows the used feature techniques and the corresponding number of extracted numerical features.

Table 5 The used feature techniques and the corresponding number of extracted numerical features

The segmentation process, as mentioned, is applied on each record for the time windows 1, 3, 5, 7, and 9 in both directions (i.e., forward and reverse). Also, the whole segmented records are concatenated in one dataset. The number of generated numerical datasets is 11. Algorithm 2 shows the followed pseudocode function during the numerical features extraction process for all records. In the pseudocode.

figure b

Table 4 shows the generated numerical datasets and the corresponding number of records per each.

Table 6 The generated numerical datasets with the correspond number of records

4.2.3 Graphical features pre-processing

The graphical features are extracted as images from each segment (i.e., sub-record) in each record similar to the numerical one. The used graphical voice features extraction techniques in the current study are (1) MFCC (HTK-style and Slaney-style), (2) Mel-Spectrogram, (3) Spectrogram, and (4) STFT. Table 7 shows the number of generated images for each category. The number of generated images for each technique is the same. Fig. 8 shows samples from each category for each technique.

Table 7 The generated images for each class
Fig. 8
figure 8

Samples from the extracted images for each technique and class

4.3 Learning and optimization phase

Algorithm 3 shows the pseudocode of the learning and optimization processes using the pre-trained CNN models and ML algorithms. It accepts three inputs (1) the selected model, (2) the dataset, and (3) the experimental configurations (from Table 9 that will be discussed in the experiments section). Inside it, it (1) split the dataset into training, testing, and validation subsets, (2) checks if the model is an ML algorithm or not, (3) if the model is an ML algorithm, it applies the grid search optimization algorithm to find the best combination that will lead the ML model to the top-1 performance metrics, (4) if the model is a pre-trained CNN model, it applied the AO metaheuristic optimizer to find the best solution that will lead the ML model to the top-1 performance metrics.

figure c

4.4 Export and statistics phase

In the current phase, the optimized model is exported to be used in the future or production. Different statistics are calculated such as accuracy, precision, and F1-score. Learning curves and figures are generated and stored. The current study calculates different state-of-the-art performance metrics. They are accuracy, F1-score, recall, specificity, the area under the curve (AUC), sensitivity, intersection over union (IoU), Dice coefficient, and precision. They are summarized in Table 8.

Table 8 Summarization of the Performance Metrics

5 Experiments and discussions

The experiments are divided into two categories (1) experiments related to the extracted numerical features using the ML algorithms and (2) experiments related to the images and extracted graphs using the pre-trained CNN models.

5.1 Experiments configurations

Generally, Python is the used programming language in the current study. The learning and optimization environments are Google Colab (with its GPU) and Toshiba Qosmio X70-A with 32 GB RAM and Intel Core i7 Processor. Tensorflow, Keras, NumPy, OpenCV, Pandas, and Matplotlib are the used Python packages [136]. The dataset split ratio is set to 85% (training and validation) and 15% (testing). Dataset shuffling is applied. The images are resized to (100, 100, 3) in RGB. Table 9 summarizes the configurations of the experiments.

Table 9 The used experiments configurations

5.2 ML experiments

The current subsection presents and discusses the experiments related to the extracted 321 numerical features using the mentioned ML algorithms (i.e., DT, AB, RFC, ETC, and KNN). For each ML algorithm, 11 experiments are applied on the 1, 3, 5, 7, 9, and mixed durations in the forward and reverse directions. The algorithms are optimized using the grid search for 5 cross-validation runs, to find the best combinations with the highest metrics. The metrics (i.e., accuracy, precision, recall, and F1-score) are captured and reported. It is worth mentioning that the “files” word refers to the 1, 3, 5, 7, 9, and mixed durations shown in Table 4.

5.2.1 K-nearest neighbor experiment

Table 10 shows the summarization of the reported results related to the KNN experiment. It is sorted in a descending order concerning the accuracy values. It shows that the “Ball Tree” algorithm and “Distance” weights are the best among other variations. The “Max-Abs” scaler is reported as the best one in 7 files while the “0” variance threshold is reported as the best one in 8 files. The maximum reported accuracy, precision, recall, and F1-score are 100%, 100%, 100%, and 100% respectively. The segmentation durations 9 in both directions are the best while the concatenated dataset reported only 99.97%. Figure 9 shows the accuracy, precision, recall, and F1-score curves of the different files.

Table 10 Summarization of the reported results of the KNN experiment
Fig. 9
figure 9

The accuracy, precision, recall, and F1-score KNN curves of the different files

5.2.2 Decision tree (DT) experiment

Table 11 shows the summarization of the reported results related to the DT experiment. It is sorted in a descending order concerning the accuracy values. It shows that the “Best” splitter and “Entropy” criteria are the best among other variations. The “Normalize” scaler is reported as the best one in 4 files while the “0.001” variance threshold is reported as the best one in 4 files. The maximum reported accuracy, precision, recall, and F1-score are 99.89%, 99.89%, 99.89%, and 99.89%, respectively. The concatenated dataset reported the best dataset among others. Figure 10 shows the accuracy, precision, recall, and F1-score curves of the different files.

Table 11 Summarization of the reported results of the DT experiment
Fig. 10
figure 10

The Accuracy, precision, recall, and F1-score DT curves of the different files

5.2.3 AdaBoost (AB) experiment

Table 12 shows the summarization of the reported results related to the AB experiment. It is sorted in a descending order concerning the accuracy values. It shows that the “50” number of estimators is the best among other variations. The “Normalize” scaler is reported as the best one in 7 files while the “0.01” variance threshold is reported as the best one in 5 files. The maximum reported accuracy, precision, recall, and F1-score are 62.61%, 62.61%, 62.61%, and 62.61%, respectively. The segmentation duration 9 in the forward direction is the best while the concatenated dataset reported only 60.47%. Figure 11 shows the accuracy, precision, recall, and F1-score curves of the different files.

Table 12 Summarization of the reported results of the AB experiment
Fig. 11
figure 11

The accuracy, precision, recall, and F1-score AB curves of the different files

5.2.4 Random forest classifier (RFC) experiment

Table 13 shows the summarization of the reported results related to the RFC experiment. It is sorted in a descending order concerning the accuracy values. It shows that the “Entropy” criterion and “50” number of estimators are the best among other variations. The “Max-Abs” scaler is reported as the best one in 4 files while the “0.005” variance threshold is reported as the best one in 6 files. The maximum reported accuracy, precision, recall, and F1-score are 100%, 100%, 100%, and 100% respectively. The segmentation durations 3N, 9N, and concatenated files are the best. Figure 12 shows the accuracy, precision, recall, and F1-score curves of the different files.

Table 13 Summarization of the reported results of the RFC experiment
Fig. 12
figure 12

The accuracy, precision, recall, and F1-score RFC curves of the different files

5.2.5 Extra trees classifier (ETC) experiment

Table 14 shows the summarization of the reported results related to the ETC experiment. It is sorted in a descending order concerning the accuracy values. It shows that the “Entropy” criterion and “100” number of estimators are the best among other variations. The “Max-Abs” scaler is reported as the best one in 5 files while the “0.01” variance threshold is reported as the best one in 4 files. The maximum reported accuracy, precision, recall, and F1-score are 100%, 100%, 100%, and 100% respectively. The segmentation durations 5N, 9N, and concatenated files are the best. Figure 13 shows the accuracy, precision, recall, and F1-score curves of the different files.

Table 14 Summarization of the reported results of the ETC experiment
Fig. 13
figure 13

The accuracy, precision, recall, and F1-score ETC curves of the different files

5.2.6 ML experiments summarization

Table 15 shows the summarization of the best-reported results related to the ML numerical experiments concerning the top-1 accuracy. Table 16 shows the summarization of the best-reported results related to the ML numerical experiments concerning the concatenated dataset. Figure 14 compares the two tables (i.e., comparison between the Top-1 and concatenated accuracies). It shows that the concatenated dataset is better compared to other datasets. The current study recommends concatenating the segmented records in different time durations and both directions.

Table 15 Summarization of the reported results of All Experiment concerning the Top-1 accuracy
Table 16 Summarization of the reported results of all experiment concerning the concatenated dataset
Fig. 14
figure 14

Comparison between the Top-1 and concatenated accuracies

5.3 CNN experiments

The current subsection presents and discusses the experiments related to the images and extracted graphical features using the mentioned pre-trained CNN models (i.e., VGG16, VGG19, ResNet50, ResNet101, MobileNet, MobileNetV2, MobileNetV3Small, and MobileNetV3Large) and AO meta-heuristic optimizer. The number of epochs is set to 5. The numbers of AO iterations and population size are set to 15 and 10, respectively, and hence 150 records are reported. The captured metrics are the loss, accuracy, F1-score, recall, specificity, AUC, IOU coef., Dice coef., and precision, as mentioned in the experiments’ configurations subsection [137].

5.3.1 MFCC using slaney experiment

Table 17 shows the summarization of the reported results related to the MFCC using the Slaney experiment. The table is sorted vertically in descending order concerning the accuracies. It shows that the VGG16 model reports the highest accuracy which is 99.17%. Figure 15 shows the accuracy, F1-score, recall, specificity, AUC, sensitivity, IoU, dice, and precision curves of the different pre-trained CNN models.

Table 17 Summarization of the reported results of the MFCC using slaney experiment
Fig. 15
figure 15

The MFCC using slaney curves of the different pre-trained CNN models

5.3.2 MFCC using HTK experiment

Table 18 shows the summarization of the reported results related to the MFCC using the HTK experiment. The table is sorted vertically in descending order concerning the accuracies. It shows that the ResNet50 model reports the highest accuracy which is 98.25%. Figure 16 shows the accuracy, F1-score, recall, specificity, AUC, sensitivity, IoU, dice, and precision curves of the different pre-trained CNN models.

Table 18 Summarization of the reported results of the MFCC using HTK experiment
Fig. 16
figure 16

The MFCC using HTK curves of the different pre-trained CNN models

5.3.3 STFT experiment

Table 19 shows the summarization of the reported results related to the STFT experiment. The table is sorted vertically in descending order concerning the accuracies. It shows that the VGG19 model reports the highest accuracy which is 98.78%. Figure 17 shows the accuracy, F1-score, recall, specificity, AUC, sensitivity, IoU, dice, and precision curves of the different pre-trained CNN models.

Table 19 Summarization of the reported results of the STFT experiment
Fig. 17
figure 17

The STFT curves of the different pre-trained CNN models

5.3.4 Mel-specgram experiment

Table 20 shows the summarization of the reported results related to the Mel-Specgram experiment. The table is sorted vertically in descending order concerning the accuracies. It shows that the ResNet50 model reports the highest accuracy which is 98.68%. Figure 18 shows the accuracy, F1-score, recall, specificity, AUC, sensitivity, IoU, dice, and precision curves of the different pre-trained CNN models.

Table 20 Summarization of the reported results of the mel-specgram experiment
Fig. 18
figure 18

The mel-specgram curves of the different pre-trained CNN models

5.3.5 Specgram experiment

Table 21 shows the summarization of the reported results related to the Specgram experiment. The table is sorted vertically in descending order concerning the accuracies. It shows that the ResNet50 model reports the highest accuracy which is 99.00%. Figure 19 shows the accuracy, F1-score, recall, specificity, AUC, sensitivity, IoU, dice, and precision curves of the different pre-trained CNN models.

Table 21 Summarization of the reported results of the specgram experiment
Fig. 19
figure 19

The specgram curves of the different pre-trained CNN models

5.3.6 CNN experiments summarization

Table 22 shows the summarization of the best-reported results related to the performed CNN experiments. It shows that the best reported overall accuracy from the applied CNN experiments is 99.17% by VGG16 in the MFCC using Slaney experiment. The average accuracy is 98.78%. Applying augmentation and “Poisson” loss function are recommended by 3 experiments.

Table 22 Summarization of the reported results of all experiments

5.4 Error analysis

The authors investigated the reasons behind the mis-classification rates in the reported results and they can be: (1) the size of the dataset is not large enough, (2) the dataset is imbalanced as shown in Table 4, (3) there is a similarity percent between multiple rows after applying the segmentation, and (4) the complexity of some models are not enough for the generalization.

5.5 Related studies comparisons

Table 23 shows a comparison between the suggested approach and related studies concerning the same used datasets.

Table 23 Comparison between the suggested approach and related studies

6 Study limitations

The results of the suggested framework are encouraging but there are still some limitations. First, the use of voice records only. Second, the selection of only eight transfer learning CNN models among the available models. Third, the study did not include the use of Long short-term memory (or other variations) for data with frequency. Fourth, the current study does not utilize the Graphical neural networks which can be used in future studies [146]. However, the results of the current study are promising and the proposed framework can be applied in hospitals.

7 Conclusions and future work

With the appliance of artificial intelligence in medical diagnosis, the detection of diseases has become more accurate. In this work, a framework for the detection of one of the widely spread diseases (i.e., Cardiovascular diseases) is proposed. The reason behind this choice is the high morbidity and mortality rate due to these diseases. The hybrid framework uses medical voice records for the detection of heart diseases. The different layers of the suggested framework are Segmentation Layer, Features Extraction Layer, Learning and Optimization Layer, and Export and Statistics Layer. The segmentation Layer is the layer in which the different records are segmented with specific durations. A novel segmentation technique using variable durations forward and backward is proposed. In the Features Extraction Layer, numerical and graphical features are extracted from the resulting datasets. These features are passed to the Learning and Optimization Layer, where numerical features are passed to 5 different Machine Learning (ML) algorithms with Grid Search optimization algorithm, while graphical features are passed to 8 different Convolutional Neural Networks (CNN) with Aquila Optimizer (AO) using transfer learning. Different performance metrics are used in the Export and Statistics Layer to validate the response of the proposed framework. The best-reported metrics are 100% accuracy, precision, recall, and F1 score using ML algorithms such as ETC and RFC. Also, the proposed approach achieved 99.17% accuracy using CNN.

7.1 Future work

In future work, the authors will apply the suggested approach on different datasets types such as waves. Also, the datasets can be handled and utilized using different optimization methods. Finally, graphical neural networks and LSTM networks can be utilized.