1 Introduction

1.1 Motivation

In December 2019, in the city of Wuhan, China, the most critical outbreak in the last hundred years began: coronavirus disease 2019 (COVID-19), transmitted by the SARS-CoV-2 virus, a virus of zoonotic origin until then unknown [15, 21, 23, 52, 64]. SARS-CoV-2, when compared to its predecessors (SARS-CoV and MERS-CoV), proved to be much more resistant and infectious. The most common symptoms are fever, dry cough, and tiredness [15, 23, 32, 52, 64]. Pain and discomfort, sore throat, diarrhea, conjunctivitis, headache, loss of taste or smell, rash on the skin, and discoloration of the fingers or toes may also appear [12, 15, 46, 52, 60, 64]. Its severe symptoms are difficulty in breathing or shortness of breath, pain or pressure in the chest, and loss of speech or movement [12, 15, 23, 32, 46, 52, 60, 64]. Despite the lower lethality, the virus spreads very quickly, producing a large volume of deaths and leaving sequels that are often permanent [12, 23, 46, 60]. Due to their high rate of contagion, public health system resources are rapidly depleted [23]. The COVID-19 pandemic is perhaps the biggest health crises in decades.

Accurate diagnosis plays an important role against COVID-19. The test established as the gold standard for the diagnosis of COVID-19 is the reverse transcription polymerase chain reaction (RT-PCR), used to search for the presence of the SARS-CoV-2 virus, translated from RNA into DNA, in samples of saliva and human secretions [27]. However, the RTPCR process can take hours or even days, considering both the volume of tests required and the logistics of transporting the samples, given the pandemic situation [27]. Late diagnosis can result in late patient care, which can make recovery difficult. In addition, non-isolated infected people can spread the virus.

There are also some rapid tests based on the identification of serological evidence of the presence of the virus as antibodies or antigens. However, these tests are nonspecific because they do not detect the presence of the virus directly. Therefore, the performance of the tests depends on other factors, such as the time of onset of the disease and the viral concentration in the sample of interest [40, 62]. Another complementary method in the process of diagnosing COVID-19 is the computerized tomography (CT) X-ray scan [2, 3, 44, 54]. Combined with RT-PCR has a great clinical value, since in CT images, it is possible to analyze the COVID-19 effects as bilateral pulmonary parenchymal ground-glass and consolidative pulmonary opacities in a precise way [39]. As a disadvantage, CT is an expensive exam, requiring a dedicated room with difficult isolation, becoming a risk factor for contamination.

Several studies have sought to highlight the nature of COVID-19 as a disease that mainly affects the cardiovascular system [13, 30, 31, 42, 57, 65]. Coronaviruses, such as SARS-CoV and SARS-CoV-2, have the angiotensin-converting zinc metallopeptidase 2 (ACE2), an enzyme present in the cell membranes of the arteries, heart, lungs, and other organs as a functional receptor. ACE2 is involved in cardiac function, hypertension, and diabetes [59]. The MERS-CoV and SARS-CoV coronaviruses can cause acute myocarditis and heart failure [65]. Some of the impacts of coronaviruses on the cardiovascular system are increased blood pressure and increased levels of troponin I (hs-cTnI) [65]. COVID-19 patients may also develop lymphopenia, i.e., low level of lymphocytes [30, 42, 57] and leukopenia, i.e., few leukocytes. COVID-19 patients may also experience decreased hemoglobin levels, absolute lymphocyte count (ALC), and absolute monocyte count (AMC) [30]. Patients who have developed severe forms of the disease have significantly higher levels of Interleukin-6 and D-dimer than patients who have developed a moderate form of COVID-19 [31]. Therefore, considering that COVID-19 is a disease that affects blood parameters, hematological tests can be used to help diagnose the disease. According to Mele et al. [43], if myocarditis is suspected in a patient with COVID-19 due to acute-onset cardiac symptoms or ECG changes, cardiac troponin and bedside echocardiography should be obtained. The same is true for patients who develop electrical or hemodynamic instability [28, 43]. Special attention should be given to changes or trends in biomarkers and not to values obtained in isolation [28, 43]. The main differential diagnoses are stress-induced cardiomyopathy, sepsis-related cardiomyopathy, and acute coronary syndrome, especially for the fulminant form of myocarditis [43].

Computational intelligence, especially machine learning techniques, has been indicated to be used in several clinical tasks involving biomedical image classification [7, 17, 20,21,22, 24,25,26, 41, 47,48,49, 51, 53, 55]. Such techniques could provide a secure and semiautomatic way to diagnosis COVID-19 in CT X-ray images and in hematological parameters [21, 33]. Therefore, they are promising to identify the disease in ECG trace.

The training of deep artificial neural networks in large image databases can involve a high computational effort, with high computational complexity of both memory and processing. This task usually requires expensive computational architectures, with a lot of data memory and parallelism resources, such as graphics processing units (GPUs), demanding days or even months of processing, depending on the size of the database under study. Deep convolutional neural networks (deep CNNs) are often quite suitable for solving image pattern recognition problems. However, these neural networks have many tuning parameters, such as the number of neurons per layer, the number of layers, the learning rate, the maximum number of iterations, and the weights of each convolution neuron. Several CNN architectures have been proposed, such as LeNet and ResNet, from which it is only necessary to adjust the weight of the neurons. However, this is still a computationally expensive task, as it is necessary to adjust tens or even hundreds of thousands of neurons.

Transfer learning consists of the use of classifiers trained for databases of objects of interest different from those used in the application: the output of the pretrained classifier is considered a representation of the input object according to a certain universe of representation. In this way, pre-trained classifiers are used as feature extractors. Thus, shallow classifiers, which can range from shallow neural networks to support vector machines, decision trees, statistical classifiers, or regressors, can be used for the final classification. Consequently, hybrid architectures based on deep transfer learning and shallow classifiers allow reducing the computational complexity associated with training large image databases, minimizing the problem of tuning training parameters and combining well-established state-of-the-art learning models. Some works have explored the use of CNNs for the problem of supporting the clinical diagnosis of COVID-19 by analyzing ECG signals represented as images [4,5,6, 45, 50]. In this work, we investigated hybrid models based on deep learning by transfer based on CNNs and Random Forests, seeking to minimize the training complexity by reducing the number of adjustable parameters. This is a desirable feature in clinical diagnosis support systems: it is important that learning machines can be retrained periodically, with their models refined as new data becomes available [9, 21].

Considering that COVID-19 is essentially a disease of the cardiovascular system, severely affecting the blood, it is to be expected that this disease affects cardiac function and, therefore, can be visualized in the expression of electrocardiographic (ECG) signals. Considering a database in which samples of electrocardiographic signals from healthy individuals and those with relatively common heart diseases are present, in addition to ECG signals from individuals with COVID-19, this work aims to investigate how COVID-19 differs in relation to heart diseases considering cardiac function and whether this differentiation is sufficient to support the clinical diagnosis of COVID-19. In this work, we propose the use of hybrid deep neural network architectures, based on pretrained deep networks in a transfer learning approach, for feature extraction, and output layers with Random Forests, for effective classification, in order to support the clinical diagnosis of COVID-19 and cardiac diseases on ECG signals. As inputs, we used images of printed traces of ECG signals obtained in a clinical environment.

1.2 Related works

Several studies emphasize the importance of hematological parameters to support COVID-19 clinical diagnosis. Some of them point to the relevance of using hematological analysis as an indicative of COVID-19 severity. Fan et al. [30] analyzed hematological parameters of 69 patients with COVID-19. A total of 13.4% of patients needed intensive care unit (ICU) care, especially the elderly. They found that monitoring these parameters can help to identify patients who will need ICU assistance. The work from Tan et al. [57] also assessed the complete blood count of patients. They used data from both cured patients and patients who died from COVID-19. Among their findings, there are some key indicators of disease progression. Therefore, monitoring these parameters may support future clinical management decisions. In this context, both Soares et al. [56] and de Freitas Barbosa et al. [21] use methods based on artificial intelligence to identify COVID-19 through blood tests. They achieved very high sensitivity and specificity results in this diagnosis.

Angeli et al. [1] examined 50 patients admitted to hospital with proven COVID-19 pneumonia. All patients underwent a detailed clinical examination, 12-lead ECG, laboratory tests, and arterial blood gas test. ECG was also recorded at discharge and in case of worsening clinical conditions. Mean age of patients was 64 years and 72% were men. At baseline, 30% of patients had ST-T abnormalities, and 33% had left ventricular hypertrophy. During hospitalization, 26% of patients developed new ECG abnormalities which included atrial fibrillation, ST-T changes, tachy-brady syndrome, and changes consistent with acute pericarditis. Patients free of ECG changes during hospitalization were more likely to be treated with antiretrovirals (68% vs 15%, p = 0.001) and hydroxychloroquine (89% vs 62%, p = 0.026) versus those who developed ECG abnormalities after admission. In addition, the majority (54%) of patients with ECG abnormalities had 2 prior consecutive negative nasopharyngeal swabs. ECG abnormalities were first detected after an average of about 30 days from symptoms onset (range 1251 days).

Bergamaschi et al. [10] evaluated 269 consecutive patients admitted with confirmed SARS-CoV-2 infection. ECGs available at admission and after 1 week from hospitalization were assessed. The authors evaluated the correlation between ECGs findings and major adverse events (MAE) as the composite of intra-hospital all-cause mortality or need for invasive mechanical ventilation. Abnormal ECGs were defined if primary ST-T segment alterations, left ventricular hypertrophy, tachy, or bradyarrhythmias and any new AV, bundle blocks, or significant morphology alterations (e.g., new Q pathological waves) were present.

Abnormal ECG at admission (106/216) and elevated baseline troponin values were more common in patients who developed MAE (p = 0.04 and p = 0.02, in this order). Concerning ECGs recorded after 7 days (159), abnormal findings were reported in 53.5% of patients and they were more frequent in those with MAE (p = 0.001). Among abnormal ECGs, ischemic alterations and left ventricular hypertrophy were significantly associated with a higher MAE rate. The multivariable analysis showed that the presence of abnormal ECG at 7 days of hospitalization was an independent predictor of MAE (HR 3.2; 95% CI 1.28.7; p = 0.02). Patients with abnormal ECG at 7 days more often required transfer to the intensive care unit (p = 0.01) or renal replacement therapy (p = 0.04).

He et al. [34] reported two COVID-19 cases that exhibited different ECG manifestations as the COVID-19 caused deterioration. The first case presented temporary SIQIIITIII morphology followed by reversible nearly complete atrioventricular block, and the second demonstrated ST-segment elevation accompanied by multifocal ventricular tachycardia. According to the authors, the underlying mechanisms of these ECG abnormalities in the severe stage of COVID-19 may be attributed to hypoxia and inflammatory damage incurred by the virus.

COVID-19 and other cardiovascular diseases (CVDs) were detected using deep-learning techniques by Rahman et al. [50]. A public dataset of ECG images consisting of 1937 images from five distinct categories, such as normal, COVID-19, myocardial infarction (MI), abnormal heartbeat (AHB), and recovered myocardial infarction (RMI) were used. Six different deep CNN models (ResNet18, ResNet50, ResNet101, InceptionV3, DenseNet201, and MobileNetv2) were used to investigate three different classification schemes: (i) two-class classification (normal vs COVID-19); (ii) three-class classification (normal, COVID-19, and other CVDs), and finally, (iii) five-class classification (normal, COVID-19, MI, AHB, and RMI). For two-class and three-class classification, Densenet201 outperforms other networks with an accuracy of 99.1% and 97.36%, respectively; while for the five-class classification, InceptionV3 outperforms others with an accuracy of 97.83%. ScoreCAM visualization confirms that the networks were learning from the relevant area of the trace images.

Attallah [4] proposed a pipeline composed of five deep learning models called ECG-BiCoNet. Features mined from higher layers were fused using discrete wavelet transform and then integrated with lower-layers features. Afterward, a feature selection approach was utilized. Finally, an ensemble classification system was built to merge predictions of three machine learning classifiers. ECG-BiCoNet accomplishes two classification categories: binary and multiclass. The results of ECG-BiCoNet present a COVID-19 performance with an accuracy of 98.8% and 91.73% for binary and multiclass classification categories.

Ozdemir et al. [45] proposed to automatically diagnose COVID-19 by using hexaxial feature mapping to represent 12-lead ECG to 2D colorful images. Gray-level co-occurrence matrix (GLCM) method was used to extract features and generate hexaxial mapping images. These generated images were then fed into a new convolutional neural network (CNN) architecture. Two different classification scenarios were conducted on a publicly available ECG image dataset. In the first scenario, ECG data labeled as COVID-19 and no-findings (normal) were classified. The proposed approach reached an accuracy of 96.20% and F1score of 96.30%. In the second scenario, ECG data labeled as negative (normal, abnormal, and myocardial infarction) and positive (COVID-19) were classified to evaluate COVID-19 diagnostic ability. The experimental results presented an accuracy of 93.00% and F1-score of 93.20%.

Attia et al. [6] trained a CNN with 26,153 ECGs, from which 33.2% were of COVID-19 positive patients. They acquired ECGs both before and after diagnosis. A third ECG was recorded 14 days after PCR result. After training, the CNN model was validated with 3826 ECGs. Test was performed with 7870 ECGs not included in other sets. Therefore, AUC for detection of COVID-19 in the test group was 0.767, with sensitivity of 98% and specificity of 10%. In Table 1, we present all the state-of-the-art results presented here in comparison to our approach.

Table 1 Summary of related works

2 Methods

2.1 Proposed method

In this study, we detail the development of the machine learning analysis step. Our method proposes and assesses the performances of some deep transfer learning hybrid architectures to perform both feature selection and classification of the input ECG images (Fig. 1). These architectures use pre-trained deep networks for feature extraction: LeNet, ResNet, and VGG16. We adopted the 5-layer version of LeNet, trained considering the MNIST dataset [8, 36,37,38]. LeNet was pretrained with the MNIST database and extracted 500 features. We employed the ResNet-50 version of ResNet [38, 58, 61]. Both ResNet and VGG16 were pretrained with IMAGENET dataset, furnishing 2048 and 4096 features, respectively [38, 58]. We used the following configuration for all deep CNNs: β1 mean decay = 0.9, lr = 0.001, β2 var decay = 0.999, ϵ = 1.0E-8, bias initialization = 0.0, no weight noise, no dropout, batch size = 128, weight initialization method = XAVIER, updater = Adam, bias updater = Sgd, gradient normalization threshold = 1.0, and number of epochs = 10. The optimizer was the stochastic gradient descent algorithm. In the output layer, we tested several Random Forest algorithms to perform classification.

Fig. 1
figure 1

Proposed method: The main idea is to propose a model of support system for the diagnosis of COVID-19 based on electrocardiography and machine learning signals. The symptomatic patient must go to a health center, where an ECG test should be ordered. In the following, the physician will be able to photograph the resulting ECG signals with the cell phone. A mobile application will be able to analyze the signals with machine learning. The results we present herein this work are related to this backend application. First, the application will pre-process the image, standardizing its background. Then, two trained classifiers will extract attributes and classify the images in two ways: 3 classes (COVID-19, healthy, and other heart diseases) and in 5 classes (COVID-19, healthy, myocardial infarction, history of MI, and abnormal heartbeat). Both reports will be available to assist the medical team in deciding the appropriate clinical management for the patient

In order to optimize our method, we investigated two approaches: A and B. In the first one, we assessed the classification performance using the entire set of features. In approach B, we added a feature selection step prior to classification. We selected the features using a Particle Swarm Optimization (PSO) algorithm with 20 individuals and 20 iterations.

In addition, two different scenarios were studied for each approach: 3 classes (COVID-19, healthy vs. other heart diseases), and 5 classes (COVID-19, healthy heartbeat (normal), abnormal heartbeat (AHB), myocardial infarction (MI), and history of MI). In the case of three classes, the class “other heart diseases” comprises all images of patients with IM, history of IM, and AHB. Overall, we evaluated the performances of the different hybrid architectures using the following scenarios:

  1. (i)

    Approach A with 3 classes

  2. (ii)

    Approach A with 5 classes

  3. (iii)

    Approach B with 3 classes

  4. (iv)

    Approach B with 5 classes

Figure 2 illustrates this method and each step is further detailed at the following sections. It is important mentioning that in all scenarios, we used 75% of the instances for training and testing, and 25% for validation. The division of these sets was made randomly by the Weka software, considering the signals of all patients in the database. In training and testing, we used cross-validation of 10 folds in 30 rounds. The cross-validation subsets were also randomly split in the software. In addition, we performed a class balancing step in the training and testing dataset. We used the Synthetic Minority Oversampling Technique (SMOTE) with 3 nearest neighbors to perform class balancing [11, 14]. Figure 2 also presents the number of features selected by PSO method for each dataset. For the 3-class problem, there was a decrease in 77.6% of the number of features extracted by LeNet architecture, 82.6% for ResNet dataset, and 80.3% for VGG16. With the datasets with 5 classes, there was a reduction in 73.2% of the features extracted by LeNet, 78.6% of ResNet, and the number of features from VGG16 experienced a decrease in 81.1%. All these steps were performed using the Waikato Environment for Knowledge Analysis (Weka), version 3.8 [63].

Fig. 2
figure 2

Description of the training of the two proposed classifiers: First, the database was pre-processed in order to standardize the background of the images. Next, we did an attribute extraction by testing three different deep architectures: LeNet, ResNet, and VGG16. The networks resulted in 501, 2049, and 4096 attributes, respectively. The third step consisted of applying two approaches: approach A, where no attribute selection was applied, and approach B, where Particle Swarm Optimization was used for feature selection. Then, 75% of the data were used for training and testing with tenfold cross-validation. For this, the SMOTE method was used to balance classes and different Random Forests were tested for classification. Finally, 25% of the data were used to validate the classifiers. This methodology was applied to both classifications with 3 and 5 classes

Our interest was to use the ECG database to investigate the relationship between COVID-19 and changes in cardiac function as a way to support clinical diagnosis. When we talk about clinical diagnosis, we refer to the process in which the clinician assesses which pathology is most likely using nonspecific tests considering a context that supports the diagnostic hypothesis to be validated. We understand that, in the context of the end of the COVID-19 pandemic, the improvement of clinical diagnosis through computational intelligence tools can contribute to clinical decision-making in the context in which COVID-19 becomes an endemic disease.

2.2 Dataset

In this work, we adopted the dataset of ECG images of cardiac and COVID-19 patients created by Khan et al. [35]. This dataset contains 1937 distinct patient records. Data was collected using ECG Device “EDAN SERIES-3” installed in cardiac care and isolation units of different health care institutes across Pakistan. The collected ECG images data were manually reviewed by medical professors using Telehealth ECG diagnostic system, under the supervision of senior medical professionals with experience in ECG interpretation. During several months, the specialist committee reviewed five distinct categories: COVID-19 (250 images), abnormal heartbeat (AHB) (548 images), myocardial infarction (MI) (77 images), previous history of MI (myocardial) (203 images), and healthy heartbeat (normal) (859 images). Therefore, the collected database has one data per patient and contains 12 lead-based ECG images. This dataset was designed to evaluate machine learning methods in studies focused on COVID-19, arrhythmia, and other cardiovascular conditions. The dataset contains rare categories of patients useful for the development of automatic diagnosis tool for healthcare institutes. Since the ECG trace images were taken at different times and places, we noticed a small variability in their background color. Thus, we convert the images to grayscale in order to minimize these differences (see Fig. 3). This step was called background standardization. Figure 4 shows samples of images from each class in the database.

Fig. 3
figure 3

Samples of original images and their respective preprocessed images

Fig. 4
figure 4

Image samples from the dataset

2.3 Metrics

To evaluate objectively the classification results, we used the following methods: the κ index, the overall accuracy, the confusion matrix, the sensitity, the specificity, and the AUC. The confusion matrix for a universe of classes of interest Ω = {C1,C2,…,Cm} is a m × m matrix T = [ti,j]m × m where each element ti,j represents the number of objects belonging to class Cj but classified as Ci [16,17,18,19,20, 29, 41].

The overall accuracy is the probability that the experiment will provide correct results, that is, to correct classify ECG images as COVID-19, normal or abnormal heartbeat, myocardial infarction, and history of MI. In other words, it is the probability of the true positives (TP) and true negatives (TN) among all the results. The sensitivity metric indicates the rate of true positive, while specificity is the rate of true negatives. AUC stands for “Area under the ROC curve.” The ROC curve, in turn, is a graph showing the true positive rate vs false positive rate. Finally, the kappa index is a statistical correlation rate [29]. Thereby, accuracy, sensitivity, specificity, and kappa index metrics can be calculated according to the Eqs. 1, 2, 3, and 4 respectively.

$$\mathrm{Accuracy}={\rho }_{v}=\frac{TP+TN}{TP+TN+FP+FN},$$
(1)
$$\mathrm{Sensitivity}=\frac{TP}{TP+FN},$$
(2)
$$\mathrm{Specificity}=\frac{TN}{TN+FP},$$
(3)
$$\kappa =\frac{{\rho }_{v}-{\rho }_{z}}{1-{\rho }_{z}},$$
(4)

where

$${\rho }_{z}=\frac{{\sum }_{i=1}^{m}\left({\sum }_{j=1}^{m}{t}_{i,j}\right)\left({\sum }_{j=1}^{m}{t}_{j,i}\right)}{{\left({\sum }_{i=1}^{m}{\sum }_{j=1}^{m}{t}_{i,j}\right)}^{2}}.$$
(5)

ρv is the accuracy, and ti,j is the element of the confusion matrix in position (i,j), i.e., the number of instances in the training set belonging to the i-th class but classified as belonging to the j-th class by the machine learning model under evaluation, for 1 ≤ i,j ≤ m.

3 Results

3.1 Approach A: without feature selection

3.1.1 The 3-class problem

Table 2 shows the results of mean and standard deviation for all studied deep architectures and classifiers. The average results obtained refer to the 30 runs made with each configuration using tenfold cross validation. The table presents the three best classifiers in blue, one from each pre-trained architecture. Moreover, the boxplots of Fig. 5 compare, in terms of kappa statistic, the performance of Random Forests with different number of trees using VGG16 architecture for feature extraction.

Table 2 Training performance for the 3-class problem using the entire set of features. There are the results of five metrics: accuracy, kappa, sensitivity, specificity, and AUC. We measured average and standard deviation values from 30 runs of training/testing with all algorithms
Fig. 5
figure 5

Training performance for the 3-class problem using the entire set of features. Each box shows the kappa statistics achieved by the different Random Forest settings using VGG16 architecture for feature extraction

Numerical results are shown according to the following standardization: we adopted the standard precision of two decimal places for values obtained in a single round; additionally, we adopted a precision of four decimal places for experiments obtained with multiple repetitions, such as cross-validation experiments, since the standard deviation obtained was very low and it was only possible to perceive it with four decimal places.

Finally, after analyzing the different classification results, the best configuration (VGG16 for feature extraction and Random Forest with 70 trees) was applied to the validation set, which contains 25% of the data from the original database. These results are presented in Fig. 6 below.

Fig. 6
figure 6

Validation performance for the 3-class problem using approach A

3.1.2 The 5-class problem

Table 3 shows the results of mean and standard deviation in the classification of ECG images, considering the scenario with 5 classes. In addition, Fig. 7 shows the training performance achieved by all Random Forest settings using VGG16 architecture for feature extraction. Ultimately, the validation results are presented in Fig. 8 below.

Table 3 Training performance for the 5-class problem using the entire set of features. There are the results of five metrics: accuracy, kappa, sensitivity, specificity, and AUC. We measured average and standard deviation values from 30 runs of training/testing with all algorithms
Fig. 7
figure 7

Training performance for the 5-class problem using the entire set of features. Each box shows the kappa statistics achieved by the different Random Forest settings using VGG16 architecture for feature extraction

Fig. 8
figure 8

Validation performance for the 5-class problem using approach A

3.2 Approach B: with feature selection

3.2.1 The 3-class problem

Table 4 shows the average results obtained in the classification of 3 classes, using the attributes selected with PSO. The number of features for LeNet, ResNet, and VGG16 was reduced from 500, 2048, and 4096 to 112, 356, and 805, respectively. In addition, the boxplots of Fig. 9 compare the performance of Random Forests with different number of trees. In this case, feature extraction was performed by VGG16, the pre-trained architecture with superior results. However, the difference between the classifiers only becomes clearer when we look at the average values in the Table 4. Figure 10 presents validation results for classification with approach B considering the 3 classes scenario.

Table 4 Training performance for the 3-class problem using the features selected by PSO. There are the results of five metrics: accuracy, kappa, sensitivity, specificity, and AUC. We measured average and standard deviation values from 30 runs of training/testing with all algorithms
Fig. 9
figure 9

Training performance for the 3-class problem using the features selected by PSO. Each box show the kappa statistics achieved by the different Random Forest settings using VGG16 architecture for feature extraction

Fig. 10
figure 10

Validation performance for the 3-class problem using approach B

3.2.2 The 5-class problem

Table 5 shows the results from experiments with selected attributes and considering 5 classes. Figure 11 also shows the boxplots with all ten configurations of Random Forests after feature extraction with VGG16. Through the graph, we can see a slight increase in the kappa index as the number of trees used also increases.

Table 5 Training performance for the 5-class problem using the features selected by PSO. There are the results of five metrics: accuracy, kappa, sensitivity, specificity, and AUC. We measured average and standard deviation values from 30 runs of training/testing with all algorithms
Fig. 11
figure 11

Training performance for the 5-class problem using the features selected by PSO. Each box shows the kappa statistics achieved by the different Random Forest settings using VGG16 architecture for feature extraction

In conclusion, Fig. 12 presents validation performance using Random Forest with 100 trees after feature extraction with VGG16. This validation step was performed with 25% of the original dataset.

Fig. 12
figure 12

Validation performance for the 5-class problem using approach B

4 Discussion

Table 2 presents the results of the investigation of the best hybrid architecture, that is, a deep neural network pre-trained in feature extraction and Random Forest classifier in the decision-making stage, for classification with three classes in approach A. For this investigation, the balanced training and test set version with the SMOTE oversampling technique is considered. Results are presented as mean and standard deviation of accuracy, kappa index, sensitivity, specificity, and area under the ROC curve (AUC). The results show that, for feature extraction with LeNet and ResNet networks, the best classification model was the Random Forest with 100 trees: average accuracy of 81.53%, average kappa of 0.7229, and sensitivity, specificity, and AUC greater than 0.98 for LeNet; and mean accuracy of 85.10%, mean kappa of 0.7764, and sensitivity, specificity, and AUC greater than 0.99 for ResNet. All these results were obtained with low standard deviations, which shows considerable stability of performance: 2.29% for accuracy, 0.0344 for kappa, 0.0139 for sensitivity, 0.0041 for specificity, and 0.0003 for AUC, for the LeNet network; and 2.46 for accuracy, 0.0369 for kappa, 0.0127 for sensitivity, 0.0042 for specificity, and 0.0001 for AUC, for the ResNet network.

However, despite the good performance of these two hybrid architectures, the best performance for approach A with three classes was obtained with the VGG16 deep network and a 70-tree Random Forest. For this architecture, a mean accuracy of 87.20%, mean kappa of 0.8080, and mean sensitivity, specificity, and AUC equal to or greater than 0.98 were obtained. The standard deviations obtained were also considerably low, comparable to those obtained with the best models based on ResNet and LeNet: 2.23 for accuracy, 0.0334 for kappa, 0.0155 for sensitivity, 0.0013 for specificity, and 0.0001 for AUC. The boxplots in Fig. 5 illustrate the behavior of hybrid architectures with VGG16 and Random Forest, showing that, in Random Forest configurations with 20 to 100 trees, the performance in terms of kappa statistics is statistically similar, but that the model with 70 trees presents greater stability of performance due to its smaller sample variance. Despite this, it is interesting to consider the use of fewer trees in practical applications, as in the application proposed in Fig. 1. In this case, the negligible loss of classification performance can be offset by the lower computational cost.

Figure 6 shows the confusion matrix obtained for the validation set, for the 3-class problem, approach A. The results for this one-shot learning approach are considerably good: accuracy of 96.67%, kappa of 0.94, sensitivity and specificity of 0.97, and AUC of 0.99. The confusion matrix shows that there is no confusion between COVID-19 and other cardiac pathologies or healthy cases. However, there is a not negligible confusion between other pathologies and healthy cases. In this case, 194 images were classified properly, while 11 images were classified as normal/healthy.

Table 3 presents the results of the investigation of the best hybrid architecture for the classification with five classes, approach A. Here, the balanced training and test set version with the SMOTE oversampling technique is considered. Results are presented as mean and standard deviation of accuracy, kappa index, sensitivity, specificity, and AUC. The results show that, for the three deep networks, the best classification model was the Random Forest with 100 trees, as highlighted in orange in the table: average accuracy of 89.37%, average kappa of 0.8671, and sensitivity, specificity, and AUC greater than 0.99 for LeNet; mean accuracy of 91.03%, mean kappa of 0.8879, and sensitivity, specificity, and AUC greater than 0.99 for ResNet; and mean accuracy of 91.36%, mean kappa of 0.8920, and sensitivity, specificity, and AUC equal to or greater than 0.99 for VGG16. All these results were obtained with low standard deviations, therefore with high stability of performance: 1.62% for accuracy, 0.0203 for kappa, 0.0117 for sensitivity, 0.0023 for specificity, and approximately 0 (zero) for AUC, for LeNet, ResNet network achieved similar results; and 1.58% for accuracy, 0.0197 for kappa, 0.0167 for sensitivity, 0.0009 for specificity, and practically 0 (zero) for AUC, when using VGG16 network.

The boxplots in Fig. 7 present the behavior of hybrid architectures with VGG16 and Random Forest, illustrating that, in Random Forest configurations with 30 to 100 trees, the performance in terms of kappa statistics is statistically similar. Despite the average results, we chose the Random Forest model with 90 trees to perform validation because it has exactly the same performance as the 100-tree model, but with fewer trees, which makes it computationally less expensive.

Figure 8 shows the confusion matrix obtained for the validation set, for the 5-class problem, approach A. The results for this one-shot learning approach are considerably good and superior to the average results obtained in the training process in terms of accuracy and kappa index: accuracy of 94.38%, kappa of 0.92, sensitivity of 0.94, specificity of 0.97, and AUC of 0.99. Furthermore, the confusion matrix shows that COVID-19 class could be clearly distinguished from the others, with all images being correctly classified. In contrast, AHB and MI classes had more confusion in classification, with 8 misclassifications in both cases. It is also interesting to note that the biggest confusions are usually with the “normal” class.

Table 4 presents the results of the investigation of the best hybrid architecture for classification with three classes, approach B. In this approach, the PSO algorithm was used with a decision tree as an objective function. For the LeNet neural network, the number of attributes was reduced from 500 to 112; for ResNet, from 2048 to 356; for VGG16, from 4096 to 805 attributes. Here, the balanced training and test set version with the SMOTE oversampling technique was considered. Results are presented as mean and standard deviation of accuracy, kappa index, sensitivity, specificity, and AUC. The results show that, for the LeNet and ResNet deep networks, the best classification model was the Random Forest with 100 trees: average accuracy of 81.04%, average kappa of 0.7155, and sensitivity, specificity, and AUC equal to or greater than 0.98 for LeNet; and mean accuracy of 85.22%, mean kappa of 0.7782, and sensitivity, specificity, and AUC equal to or greater than 0.98 for ResNet. The performance of the two models was quite stable, with standard deviations of 2.40% for accuracy, 0.0360 for kappa, 0.0180 for sensitivity, 0.0052 for specificity, and 0.0016 for AUC, for LeNet; and 2.20% for accuracy, 0.0330 for kappa, 0.0145 for sensitivity, 0.0050 for specificity, and 0.0003 for AUC, for ResNet.

Nevertheless, the best performing model was the VGG16 with Random Forest of 60 trees, having achieved an average accuracy of 87.06%, average kappa of 0.8059, and sensitivity, specificity, and AUC equal to or greater than 0.99 for VGG16. The sample standard deviations were low, so the classification performance was highly stable: 2.38% for accuracy, 0.0357 for kappa, 0.0162 for sensitivity, 0.0010 for specificity, and 0.0001 for AUC, for the VGG16 network. The boxplots in Fig. 9 show that, for hybrid architectures with VGG16 and Random Forest, showing that, in Random Forest configurations with 30 to 100 trees, the performance in terms of the kappa statistic is statistically similar. So, we kept choosing the model with 60 trees. It is interesting to note that the results were close to those obtained previously with all attributes, with slightly lower averages, but with overlapping intervals. Thus, it is possible to use a significantly smaller number of attributes and maintain a good classification performance.

Figure 10 shows the confusion matrix obtained for the validation set, for the 3-class problem, approach B. The results for this one-shot learning approach show a high performance of model generalization: accuracy of 96.26%, kappa of 0.94, and sensitivity, specificity, and AUC of 1.00. The confusion matrix shows that there is no confusion between COVID-19 and other cardiac pathologies or healthy cases, but there is a non-negligible confusion between the other pathologies and healthy cases, as the “normal” class was classified 6 times as “other.”

Table 5 presents the results of the investigation of the best hybrid architecture for the classification with five classes, approach B. In this approach, the PSO algorithm was used with a decision tree as an objective function. For the LeNet neural network, the number of attributes has been reduced from 500 to 134; for ResNet, from 2048 to 438; for VGG16, from 4096 to 773 attributes. Here, the balanced training and test set version with the SMOTE oversampling technique was considered. The results show that, for the three deep neural networks, the best classification model was the Random Forest with 100 trees: mean accuracy of 88.23%, mean kappa of 0.8529, and sensitivity, specificity, and AUC equal to or greater than 0.99 for LeNet; mean accuracy of 90.50%, mean kappa of 0.8812, and sensitivity, specificity, and AUC equal to or greater than 0.99 for ResNet; and mean accuracy of 91.13%, mean kappa of 0.8891, and sensitivity, specificity, and AUC equal to or greater than 0.99 for VGG16. The performance of the three models was quite stable, with standard deviations of 1.80% for accuracy, 0.0225 for kappa, 0.0136 for sensitivity, 0.0026 for specificity, and 0.0002 for AUC, for LeNet; 1.56% for accuracy, 0.0195 for kappa, 0.0079 for sensitivity, 0.0024 for specificity, and practically 0 (zero) for AUC, for ResNet; and 1.51% for accuracy, 0.0189 for kappa, 0.0137 for sensitivity, 0.0010 for specificity, and practically 0 (zero) for AUC, for VGG16. The boxplots in Fig. 11 show that, for hybrid architectures with VGG16 and Random Forest, in Random Forest configurations with 30 to 100 trees, the performance in terms of kappa index is similar. So, we kept choosing the model with 100 trees.

Figure 12 shows the confusion matrix obtained for the validation set, for the 5-class problem, approach B. The results for this one-shot learning approach show a high performance of model generalization: accuracy of 93.54%, kappa of 0.91, and sensitivity, specificity, and AUC of 1. The confusion matrix shows that there is no confusion between COVID-19 and other cardiac pathologies or healthy cases. The confusion between cardiac pathologies and healthy cases decreased with the discrimination of pathologies. Only a non-negligible confusion between AHB and healthy cases remained. This shows that abnormal heartbeats, although highly differentiable in the ECG trace of normal cases, can still maintain some confusion, albeit small.

The 3-class classification scenario models the COVID-19 screening step: ECG signals represented as images are classified as healthy (normal), COVID-19, or other heart disease. The results considering all the characteristics (VGG16 with Random Forest of 70 trees, approach A, cf. Figure 6) show that this approach is quite effective to integrate a tracking tool, as it was able to obtain an accuracy of 97%, a sensitivity of 97%, a specificity of 97%, and an area under the ROC curve of 99%. Looking at the confusion matrix, it is clear that the signs of COVID-19 are not confused with either the signs of healthy individuals or the signs of individuals with heart disease. However, there is a slight confusion between electrocardiographic signals from healthy individuals and individuals with heart disease: 5 healthy individuals were classified as having heart disease, while 11 patients with heart disease were classified as healthy. The results considering only the most relevant characteristics, pointed out by the PSO (VGG16 with Random Forest of 60 trees, approach B, cf. Figure 10), are slightly higher in terms of sensitivity and specificity: the model was able to obtain an accuracy of 96%, a sensitivity of 100%, a specificity of 100%, and an area under the ROC curve of 100%. Considering the confusion matrix, the behavior was similar: the signs of COVID-19 are not confused with either healthy individuals or individuals with heart disease. The slight confusion between signals from healthy individuals and from individuals with heart disease remains: 6 normal subjects were classified as heart disease, while 12 heart disease patients were classified as healthy. These results show that the models for classification of 3 classes, both of approach A and approach B, are useful for the screening of COVID-19 from the cardiac activity expressed in the electrocardiography tracing. The approach B model, however, is more recommended, as it considers fewer features (805 selected features against the original 4096 features), requiring less computational effort, which makes it more suitable for composing web services to support the screening of COVID-19.

The classification scenario with 5 classes models the support for the clinical diagnosis of COVID-19 and heart diseases from the ECG analysis. One of the most important motivations for this scenario is the need to differentiate the changes in cardiac function caused by the moderate and severe forms of COVID-19, often confused in clinical practice with opportunistic myocarditis, from those caused by cardiopathic conditions, such as abnormal heart beat, myocardial infarction, myocardium, and history of myocardial infarction [28, 43]. The results for the best model of approach A (VGG16 with Random Forest of 90 trees, cf. Figure 8) were quite reasonable: 94% accuracy, 94% sensitivity, 97% specificity, and 99% of AUC. Considering the confusion matrix, it can be noted that there is no confusion between the ECG signals of patients with COVID-19 and healthy and heart disease patients. There is, however, a little confusion between patients with a history of myocardial infarction and healthy patients (3 subjects); between patients with abnormal heartbeats, myocardial infarction (1 patient), and healthy individuals (7 individuals); between myocardial infarction, abnormal beats (1 subject), and healthy subjects (7 subjects); and among healthy subjects, patients with a history of myocardial infarction (2 patients) and patients with abnormal beats (5 patients). The best model of approach B (VGG16 with Random Forest with 100 trees, cf. Figure 12) had slightly better results: accuracy of 94%, sensitivity, specificity, and AUC of 100%. The confusion matrix shows small confusions similar to those obtained by the best model of approach A, with the advantage of using only 774 features of the 4096 original features. These results show that the changes in cardiac function caused by the moderate and severe forms of COVID-19 are considerably different from those caused by the cardiopathies under study.

5 Conclusion

This research aimed to investigate the influence of COVID-19 on cardiac function expressed through electrocardiography signals, from a machine learning point of view. To this, we considered a public database of printed ECG photographic images, obtained from records in the context of emergency care. We found that COVID-19 may affect the cardiovascular system and makes some pattern changes in ECG trace that could be identified by machine learning algorithms. This study provides a satisfactory method using hybrid architectures composed of deep neural networks pre-trained in extracting features from ECG and Random Forests images in the output layer, for classification of the ECG tracing in two approaches: COVID-19 against cardiopathies and signals from healthy individuals, i.e., a 3class problem; and COVID-19 against abnormal heartbeat, myocardial infarction, history of myocardial infarction, and healthy heartbeat, i.e., a 5 class problem.

On choosing the best classifier architecture: in most cases, the classification with Random Forest with 100 trees presents a higher average result in the training and test set. However, considering the standard deviation, the results of the various classifiers investigated overlap. Therefore, for practical applications, it is interesting to consider an approach with fewer trees as a choice criterion, reducing the computational cost and memory occupied in the application, as we did in this work.

Considering the feature selection step: for all configurations of extracted attributes, the selection of features with PSO selected less than half of the original attributes, while the results for the training and testing and validation steps remained equivalent, with good averages and low variances.

Taking into account the difference between approaches with and without feature selection, i.e., approaches B and A, respectively: the performances were very similar and can be considered statistically equivalent, even with a significant reduction in the number of attributes, which shows that the selection of attributes was an important approach to reduce the computational complexity without significant loss of ranking performance. This is important for the implementation of the solution, especially if a client–server architecture is chosen, because if the server-side solution has little processing and memory requirement, there will be more availability of the service to support the diagnosis of COVID-19 and heart diseases by electrocardiography imaging.

In both approaches (A and B), both with 3 (screening support) and 5 (clinical diagnosis support) classes, there was no confusion in the classification of the COVID-19 class. When looking at the confusion matrices, all COVID-19 images were classified as COVID-19. However, the experimental results showed that the 5-class approach was able to achieve better classification performance than the 3-class approach. Our hypothesis is that, in the problem with 3 classes, the grouping of pathologies other than COVID-19 contributed to the worsening of the classification, due to the fact that these pathologies, i.e., abnormal heartbeat (AHB), myocardial infarction (MI), and history of MI, are quite different from each other from a physiological point of view and, therefore, in the ECG trace.

Finally, this work showed that the influence of COVID-19 on cardiac function expressed in the electrocardiographic signal is quite considerable and distinct from changes in cardiac function caused by other heart diseases: the detection of the COVID-19 class in a classification problem with deep learning showed no confusion with any heart disease, nor with signs of healthy individuals. Thus, this research showed that the scenarios of cardiomyopathy and thromboembolism caused by COVID-19 present distinct changes in cardiac function, expressed in electrocardiography. However, the database used in this research does not present information regarding the days after infection by COVID-19. It is only reported that ECG signals are captured from patients with moderate or severe COVID-19, admitted to a semi-intensive or intensive care unit. This research also showed that it is possible to build a solution to support both the screening and the clinical diagnosis of COVID-19 in the context of emergency care from a non-invasive and technologically scalable solution, based on machine learning. This solution allows both the support of screening and clinical diagnosis and the evaluation of the treatment of patients with COVID-19 or with different heart diseases, not requiring in-depth knowledge of the interpretation of the ECG regarding the changes in cardiac function caused by COVID-19.