1 Introduction

Coronavirus (COVID-19/SARS-CoV-2) is a new type of disease that cause a range of illnesses to human [1, 2]. Symptoms include fever, tiredness, breathing difficulties, dry cough, and severe acute respiratory syndrome (ARDS). In more serious cases, coronavirus could lead to death. As of Feb 2023, more than 755 million positive cases have been reported, with more than 6.8 million deaths [3]. Although most COVID-19 infections were asymptomatic or mild, there were significant numbers of COVID-19 cases that required intensive care, and many were fatal [4]. Severe COVID-19 cases have been associated with several factors such as age, gender, race, and various comorbidities [5].

To efficiently diagnose patients and provide the appropriate treatments, patients’ data is needed [6, 7]. This is because is patients’ data provides essential information about each patient parameters values [8]. This information is crucial in helping specialist to speed up diagnoses and provide a rapid treatment with high accuracy [7]. Patient data can also assist in monitoring the progression of a disease over time. Such information will help in making data-driven decisions about treatment and fine-tuning the treatment accordingly. By analyzing patient data, authorities can identify harmonies among patients and take the necessary actions to minimize further spread of the disease [7]. More importantly, patient data is critical in medical research alike as it is considered as the main factor in the development of treatments and cures.

Rumaling et al. [9] introduced a new method for detecting coronavirus using Raman spectroscopy. The proposed approach focuses on utilizing the unique “biofingerprint” characteristics displayed by the virus to distinguish it from other pathogens. The study used a total of 150 samples divided into two parts. The first part consists of 75 nasal swab samples obtained from individuals who had contracted SARS-CoV-2, indicating positive cases. Whereas the second part consists of the remaining 75 nasal swab samples from individuals who were in good health. By analyzing the Raman spectra of virus samples, scientists can identify specific molecular vibrations that serve as distinctive markers for coronavirus. The study emphasized the benefits of Raman spectroscopy, including its quick analysis, non-destructive nature, and minimal sample preparation requirements. The researchers conducted experiments with different coronavirus strains and obtained promising outcomes in accurately identifying and differentiating them from other viruses. The authors claimed that by utilizing the biofingerprint characteristics of the coronavirus, this technique could contribute to the creation of rapid diagnostic tools for effective disease control and surveillance.

Altan and Karasu [10] aimed to identify patients who are infected with coronavirus pneumonia using X-ray in order to distinguish the pneumonia caused by COVID-19 from community-acquired pneumonia. The total trained dataset contained 2905 cases only. Around 1341 of the cases were normal images, 219 were COVID-19 positive images, and 1345 viral pneumonia images. While the total tested samples were only 581 in which 268 of the dataset was normal images, 44 positive images, and 269 viral pneumonia images. A hybrid model consists of 2D curvelet, transform, CSSA, and EfficientNet-B0 was used in this study. The accuracy, specificity, precision, recall and F-Measure for the hybrid model were considered to be very high which are 99.69%, 99.81%, 99.62%, 99.44% and 99.53%, respectively.

The textual clinical report classified COVID-19 into four classes: COVID Class, SARS, ARDS and both COVID and ARDS [11]. The dataset was obtained from the GitHup repository. It consists of 212 patients and 24 parameters. It is unstructured dataset that represents the clinical information of the patients such as survival, date, extubated,, intubated, modality, temperature, pO2 saturation, needed supplemental O2, leukocyte count, lymphocyte count, neutrophil count, went icu, view, location, filename, folder, DOI and other notes [11]. The data has been analyzed using different machine learning algorithms, which are Vector Machine, Multinomial Naive Bayes, Logistic Regression, Decision Tree, Random Forest, Bagging, Adaboost and Stochastic Gradient Boosting. Around 70% of the dataset has been used for training and 30% of it for testing. In addition, cross-validation test was applied with all the algorithms. Logistic Regression and Multinomial Naive Bayes Algorithms recorded the highest results than other algorithms by having precision 94%, recall 96%, F1 score 95% and accuracy 96.2%.

In Oman, previous reports have shown that obesity, chronic lung disease and chronic kidney disease are risk factors for COVID-19-related intensive care unit admission and death in patients over 60 years old [12]. These specific underlying conditions have been also shown to be associated with severe COVID-19 disease in other countries [13, 14]. These studies and others strongly indicate a link between various underlying health conditions and the susceptibility to negative COVID-19 effects. However, the emergence of highly transmissible new variants has further complicated the landscape of COVID-19. It has been perceived that there is a genetic diversity (different variants) in the strains dispersed in Oman due to the multiplicity of sources of COVID-19 virus, as the infected people came from different countries [15]. Countries around the world got different variants of COVID-19 [16, 17]; hence, the countries respond differently [18]. Analyzing the different variants will help in minimizing the risk of death as some of these variants affect the antibodies and make the immune system ineffective to tackle the viruses [19]. Consequently, some viruses escape from the immune system prior infection and after the vaccination [19]. Understanding the parameters underlying the disease prognosis is critical for patient care and disease management [20]. The WHO confirmed that many of the COVID-19 patients have recovered without any medical intervention [21]. In Contrast, there were cases in which medical treatments were involved especially for old patients and those who suffer from some of the chronic diseases [22, 23]. Thus, there is a need to considered a dataset from Oman to predict the local parameters that cause the death in order to improve diagnosis and treatments of COVID-19 in Oman. This calls for the use of more sophisticated approaches such as Machine Leaning (ML) to assist in predicting the blood parameters that boost immunity against COVID-19 disease. This will help in developing self-therapies and strengthening the immune system against future attacks. Moreover, due to the different variants of coronavirus, courtiers respond differently to the COVID-19 patients [18].

Medical data mining by using ML tools has a great potential in extracting hidden patterns from huge number of datasets that can be utilized in clinical practices [24, 25]. During the pandemic, ML has been applied to various aspects of the disease including treatment, diagnosis, prognosis, and epidemiology [26, 27]. This study is following a design science paradigm [28] to a data -driven modeling to answer the following research question. How can big data and ML tools be leveraged to predict blood parameters of covid-19 patients for an effective mechanism for better healthcare facilities? Dataset was collected from the Ministry of Health in Oman. The 10-fold Cross Validation was applied to ensure the reliability of the results. This study proposes guidelines to stakeholders to manage and mitigate the impact of Covid-19 or similar disease on patients it was revealed that abnormality in Hemoglobin, Mean Cell Volume, and Eosinophil are the main risk factors for poor prognosis in older patients. This finding will help the decision makers in restructuring and prioritizing the treatment protocol.

2 Related work

The variability in the severity of COVID-19, a respiratory ailment with widespread global impact, is notable among different individuals. The identification of clinical markers that possess the ability to accurately anticipate the severity of a given ailment can greatly facilitate the implementation of timely and focused intervention strategies. Shang, Dong, et al. [29] conducted an analysis of clinical data from 443 individuals diagnosed with COVID-19, categorizing them into non-severe and severe groups. The researchers’ investigation identified the Neutrophil-to-lymphocyte ratio (NLR) and C-reactive protein (CRP) as two notable predictors of illness severity. These findings align with previous meta-analytical research conducted by Lagunas-Rangel, which emphasized that higher NLR values were indicative of more severe manifestations of COVID-19. The historical establishment of CRP as an inflammatory marker and its association with higher levels in individuals infected by the 2019-nCoV have demonstrated its importance in the context of acute lung injury. This assertion was additionally supported by prior studies that identified hypoalbuminemia, lymphopenia, and CRP levels surpassing 40 mg/L as potential indicators of the likelihood of respiratory failure in persons infected with MERS-CoV and suffering from pneumonia. Furthermore, the investigation delved into the significance of platelets in relation to the severity of COVID-19, revealing that a higher platelet count may serve as a potential safeguard against the development of severe symptoms associated with the illness. This is consistent with the research conducted by Georges, Brogly et al. [30] on the topic of thrombocytosis in patients diagnosed with severe community-acquired pneumonia, as published in the Chest journal. The user has provided a DOI link to a scholarly article titled "Chest" with the individual(s) who established a correlation between severe thrombocytopenia and heightened mortality in instances of severe community-acquired pneumonia. However, in light of the divergent perspectives outlined in Elmaraghy’s research concerning the correlation between platelets and mortality among individuals with pneumonia, further examination is necessary to fully comprehend the precise function of platelets in forecasting the severity of COVID-19.

The study conducted by Masana et al. [31] examined a total of 1411 COVID-19 patients who were admitted to the hospital. The objective of the study was to investigate the potential association between the patients’ plasma lipid profile and their clinical outcomes. The study findings demonstrated that people who experienced severe indications of COVID-19 exhibited a significant decrease in high-density lipoprotein (HDL) cholesterol levels and an increase in triglyceride levels, both prior to and during their infection. Significantly, the aforementioned lipid abnormalities have been recognized as robust indicators of a serious progression of the disease. In addition to these findings, it was shown that the lipid profile had a close association with ferritin and D-dimer concentrations, while appearing to be unrelated to CRP levels. The study emphasized the significance of dyslipidemia, particularly atherogenic dyslipidemia, in relation to adverse COVID-19 results. The aforementioned observations about the lipid profile of individuals indicate the possible usefulness of this profile as an indicator of inflammation, hence justifying the need to assess its value in cases with COVID-19. Moreover, the techniques utilized for this research highlight the significance of taking into account intricate non-linear associations among predictors. This is exemplified by the implementation of Random Forests models in conjunction with regularized logistic regression. Cumulatively, these findings underscore the significance of the lipid profile as more than just a basic biomarker, but rather as an integral element within the broader context of COVID-19’s interaction with the physiological mechanisms of the host.

An investigation, undertaken by Statsenko et al. [32], involved the evaluation of a cohort including 560 patients who were diagnosed with COVID-19 at the Dubai Mediclinic Parkview Hospital. The study was done during a period extending from February to May 2020. The researchers had difficulties in establishing cut-off values due to an imbalanced dataset, particularly due to the unequal distribution of ICU hospitalizations and non-severe cases. Nevertheless, by skillfully implementing a supervised machine learning algorithm, the thresholds were modified in order to improve the predicted accuracy of the model. Consequently, certain laboratory test thresholds were established and rationalized, including a lymphocyte count below 2.59 × 10^9/L and a C reactive protein level of 14.3 mg/L, among other criteria. It is noteworthy that during the analysis of neural network performance, a model that was trained exclusively on a subset of high-value tests, specifically aPTT, CRP, and fibrinogen, produced an area under the curve (AUC) of 0.86. This performance was found to be quite equivalent to that of a more comprehensive model trained using all available tests, which yielded an AUC of 0.90. These insights present encouraging opportunities for physicians, equipping them with distinct laboratory indicators to monitor, which may potentially inform interventions and decisions regarding patient care throughout the ongoing epidemic. In addition, the creation of an openly available digital platform derived from the research outcomes offers a pragmatic asset for healthcare practitioners on a global scale.

Aktar et al. [33] conducted a comprehensive study to explore the possibilities of employing peripheral blood data for predicting clinical outcomes in patients with COVID-19. The aim of the study was based on the recognition that the rapid analysis of blood samples might provide valuable information not only for confirming diagnoses, but also for predicting the progression of the disease. Utilizing a diverse variety of machine learning algorithms, such as the random forest, gradient boosting machine, and deep learning techniques, the researchers conducted an in-depth analysis of clinical records in order to identify potential correlations and discern unique patterns. The findings of their study revealed certain hematological indicators that have the potential to act as distinguishing variables between individuals who are uninfected with COVID-19 and those who have contracted the virus. Significantly, the researchers discovered certain subsets of data, such as lactate levels, immature granulocytes, hemoglobin, and others, which exhibited substantial predictive capability in relation to the severity of symptoms associated with COVID-19. The approaches employed in their study demonstrated highly encouraging accuracy ratings, frequently over 90%. These results indicate that the routine analysis of blood samples has the potential to aid in the timely detection of individuals at a heightened risk level. These technological breakthroughs provide significant potential, particularly in underdeveloped nations where there is a lack of critical care services. Despite certain limitations, such as the relatively small sample size and the absence of comprehensive clinical data, Aktar et al. [33] have presented a fundamental study that highlights the indispensable contribution of machine learning and clinical blood data in augmenting the response to COVID-19. This study underscores the necessity for additional international investigations in this field.

The potential of hematological indicators in assessing the severity and mortality risks among COVID-19 patients in Pakistan is demonstrated by Asghar et al. [34] retrospective investigation. The study conducted by the researchers involved a sample size of 191 individuals who tested positive for COVID-19 using the polymerase chain reaction (PCR) method. The findings of this study provided valuable insights, notably on the significance of mean hemoglobin levels, leukocyte count, Neutrophil-to-Lymphocyte ratio (NLR), and various other characteristics. A notable differential was observed between patients admitted to general wards and those in intensive care units (ICUs) in relation to these hematological indicators. The notable focus was on the significant variance shown in the Neutrophil-to-Lymphocyte Ratio (NLR) and Platelet-to-Lymphocyte Ratio (PLR) among the two distinct cohorts of patients. Furthermore, the differences observed between the survivors and those who did not survive were particularly noteworthy. The findings of their analysis indicated that elevated levels of NLR and PLR were observed in severe cases of COVID-19. Conversely, the Lymphocyte-to-Monocyte ratio (LMR) and Lymphocyte-to C-reactive protein ratio (LCR) demonstrated an inverse correlation with the severity of the disease. This study emphasizes that while inflammatory and hematological indicators offer valuable insights into the initial phases of the disease, their effectiveness in predicting overall mortality or therapeutic outcomes throughout inpatient treatment may be limited. The research highlights the necessity for further comprehensive follow-up studies to effectively utilize these indicators for patient care and prognosis, while underlining the importance of NLR, PLR, LMR, and LCR in relation to the severity of COVID-19 and mortality rates.

The study conducted by Seyit et al. [35] examines the predictive capabilities of specific hematological indicators, including C-reactive protein (CRP), white blood cell count (WBC), neutrophil-to-lymphocyte ratio (NLR), and platelet-to-lymphocyte ratio (PLR), among other variables. The study, which was conducted at Pamukkale University Hospital, had a total of 233 patients who were admitted within a two-month period from March to April 2020. The study yielded significant findings that are of great importance. Significantly, individuals who were diagnosed with Sars CoV-2 using polymerase chain reaction (PCR) testing exhibited noticeably increased levels of C-reactive protein (CRP), lactate dehydrogenase (LDH), platelet-to-lymphocyte ratio (PLR), and neutrophil-to-lymphocyte ratio (NLR). This observation implies a plausible association between the heightened biomarkers and the presence of COVID-19. In contrast, people who tested negative for the virus exhibited significantly elevated levels of eosinophil, lymphocyte, and platelet counts. The aforementioned results provide insight into the potential of utilizing these characteristics as additional diagnostic tools, enhancing the conventional real-time PCR evaluations. Nevertheless, it is important to acknowledge that the study conducted by Seyit et al. takes a retrospective approach and emphasizes the necessity for future research efforts that focus on prospective investigations. These initiatives are crucial in order to enhance and validate the effectiveness of these biomarkers, particularly in determining the most appropriate threshold values.

Yadav et al. [36] conducted a research to highlight some of the essential tasks that contributed to the spread of COVID-19. Examples of these task are: COVID-19 growth rate, how COVID-19 will end, transmission rate of the virus and the correlation between the weather condition and COVID-19. The used dataset is related to different countries which are Mainland, China, US, Italy, South Korea and India. The used dataset represents the total number of positive cases, recoveries, deaths in 93 days. The data of the temperature, wind speed and humidity were used to find the correlation between the spread of COVID-19 and the weather condition. Support Vector Regression method was used in this research and its results were compared with different regression models like Simple Linear Regression and Polynomial Regression. Although the mentioned accuracy rates considered high, the measurement methods for the accuracy rate were not mentioned.

Khalifa et al. [37] employed machine learning algorithms to classify coronavirus treatments type and level on a single human cell. The dataset was obtained from the RxRx.ai repository. It consists of more than 1660 types of approved drugs in a human cell and more than 300,000 listed experiments. DCNN model was used in this research and it consists of three layers which are three ReLU layers, three pooling layers, and two fully connected layers. The proposed model was compared with other machine learning algorithms like support vector machine, decision tree and ensemble. In term of treatment classification, DCNN model recorded the highest accuracy results, which is 98.05% in comparison with the other algorithms. In term of treatment level, classical machine learning (ensemble) recorded close score to the proposed DCNN model (98.5% and 98.2% respectively).

Wu et al. [38] employed a machine-learning algorithm to help in the speeding up diagnosis of COIVD19 patients. The used dataset contains 253 samples and 49 parameters. The parameters included, age, gender, tuberculosis, lung cancer and pneumonia. To predict the target class, the random forest algorithm was used. The 10-fold cross-validation method was used to validate the obtained classification results. To evaluate the classification accuracy, several measurements were used such as Matthews correlation coefficient (MCC), AUC, and total accuracy (ACC). The experimental results showed that the developed model is highly effective. It managed in shortening the process of the laboratory blood test by providing a speedy diagnosis for the infected patients. However, the major limitation of the study by Wu et al. [38] is that only one algorithm is employed in the experiments without providing an appropriate justification for not considering the other effective algorithms.

Based on the above information, it can be notice that most of the studies have not used the blood parameters in their experiments. Altan and Karasu [10] is limited to the pneumonia images Yadav et al. [36] has used weather dataset, Khalifa et al. [37] addressed the human cell. Wu et al. [38] has used few promising parameters; however, one single algorithm was employed in experiments. The use dataset by Khanday et al. [11] is limited to a clinical unstructured data. On top of this, due to the different variants of coronavirus, courtiers respond differently to the COVID-19 patients [18]. Thus, there is a need to consider a dataset from Oman to predict the parameters that cause the death in order to improve diagnosis and treatments of COVID-19 in Oman.

3 Data and methods

The clinical dataset was collected from the Royal Hospital in Oman with CBC test results. A total of 437 cases with the mean age of 33.54 ± 26.82 (range: 13–90) years were studied (79.6% female). All cases had positive RT-PCR for COVID-19 and were admitted to the hospital. Eighty-one percentages (354 cases) recovered while 19% (83 cases) died. The dataset includes various parameters such as age, gender, complete blood count, comorbidities, and blood type. The primary outcome for this study was death as a result of COVID-19 disease. Initially, the dataset has some missing values in some of the parameters. Removing cases which have missing values will result in having imbalanced small dataset. Thus, all parameters which have missing values were removed.

Table 1 summarized the blood parameters of the patients with different disease outcomes. CBC analysis showed that most of COVID-19 patients had changes in most of blood parameters. Although most of the blood parameters were within the range, few parameters had means that were out of the normal range. These included hemoglobin (8.71 g/dL ± 1.92), RBC counts (3.5 × 106 ± 0.79), mean cell hemoglobin (24.86 pg ± 2.36), red cell distribution width-CV (17.63% ± 2.19), neutrophils (6.64 K/μL ± 7.95), eosinophils (0.03 K/μL ± 0.08), platelet count (197.59K/μL ± 112.33) and Creatinine (243.35 mm/L ± 165.50). These results are consistent with previous studies that reported low thrombocyte and neutrophil counts in COVID-19 positive patients [39]. The cohort in this study is overwhelming composed of anemic patients. Previous studies have also found that COVID-19 patients, particularly female patients, had lower hemoglobin, which is further confirmed in this study [39]. Around 88.8% of the subjects in this study had hemoglobin lower than the normal range (11.5–15.5 g/dL) and most of these subjects are female (79.6%). Among 348 females, 324 (93.1%) of them have hemoglobin below the normal range, while 63 out of 89 male patients (70.8%) have hemoglobin below the normal range. Moreover, the majority of the patients (86.7%) had renal malfunction as indicated by the elevated serum creatinine level. Other few parameters were also abnormal, although the overall means of these parameters were within the normal ranges. These parameters included platelets count, neutrophil count and basophil count. These parameters had significant a number of subjects with abnormal counts either below or above the normal range.

Table 1 Dataset of the study

This research adopts the Knowledge Data Discovery (KDD) methodology [40, 41] for carrying out the artificial intelligence and machine learning projects. KDD it is widely used in the literature by many researches due to its effectiveness [42, 43]. It normally ensures that the obtain results are of high quality, scientifically valid, and potentially useful [43]. Based on the type of the knowledge which can be discovered in dataset, KDD techniques can be broadly classified into several categories, including clustering, classification, association, estimation, etc.

Following a typical KDD process roadmap, where data mining is the core in the overall processes, the experiments will go through the following steps in the specified orders, as show in Fig. 1: problem specification, resourcing, data cleansing, pre-processing, data mining, evaluation of the results, interpretation of the results, and exploitation of the results.

To identify the most effective parameters of the dataset, two feature selection algorithms were employed: InfoGain and Correlation. The Information Gain algorithm calculates the information gain for the target class by measuring the value of a feature. The result of this algorithm is presented Fig. 2A. The Correlation algorithm, is commonly used in the literature by many researchers. It calculates the correlation between each feature and the target class to determine the productivity of the feature. The result of this algorithm is presented in Fig. 2B. As shown that both algorithms ranked similar parameters as significant predictive factors in determining the final outcome and disease progression. Age was the most important predictive factor followed by other factors that included gender, hemoglobin, hematocrit, mean cell volume, and eosinophil count (Fig. 2A and B).

Fig. 1
figure 1

KDD road map

Fig. 2
figure 2

Royal-CBC: significance of the parameters

One of the most important issues in predictive analysis is measure and evaluate the classification quality, usually in terms of accuracy. Many measurements are used to evaluate the effectiveness of the model including: Accuracy, Precision, Recall, F-Measure, MCC, ROC Area, PRC Area.

Accuracy: is a metric that is widely used in the context of classification. In practice, two measurements are commonly used for estimating classification accuracy or error. The accuracy r can be measured by \(r=\frac{1}{n}\sum_{i=1}^{C}{a}_{i}\) [44,45,46,47,48,49,50] where ai is the number of majority cases with the same label in class i, C is the number of classes, and n the total number of cases in the dataset. Hence, the classification error can be obtained by e = 1–r. The smaller the value of e is, the better the results are. The same logic is employed in [51] but the function presented differently:

$$E_{c} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} \left( {s_{i} - M_{i} } \right)}}{{\mathop \sum \nolimits_{i = 1}^{k} s_{i} }} = \frac{{\mathop \sum \nolimits_{i = 1}^{k} \left( {s_{i} - M_{i} } \right)}}{n}$$

where Si is the size of class i and Mi is the number of majority cases with the same label in class i.

Precision: precision calculates the total relevant classification results (correct death and correct recovered cases) by the algorithm. It is defined as follow:

$$Precision = \frac{True\,Positive}{{True\,Positive + False\,Positive}}$$

Recall: recall is measurement technique that is used to assess the effectiveness of the used algorithm by measuring the proportion of the actual positives and correctly true. Mathematically it is calculated as:

$$Recall = \frac{True\,Positive}{{True\,Positive + False\,Negative}}$$

F-Measure is the average of Precision and Recall. It is calculated as:

$$F - Measure = \frac{2*Precision*Recall }{{Precision + Recall}}$$

ROC Curve: it stands for Receiver Operating Characteristics. It is used to illustrate how a classifier isolates true and false classes in order to identify the most optimal threshold for separating them. It is generated by plotting TP Rate vs FP Rate for different threshold values.

4 Results and discussions

InfoGain and Correlation algorithms highly ranked the first six parameters (Age, Blood Group, Gender, Mean Cell Hb, Mean Cell Volume, Hemoglobin, and Haematocrit of Blood) as a significant predictive factor for the final outcome (target class). Several machine learning algorithms were then developed based on the most predictive algorithm.

To conduct the experiments, the following classification algorithms have been used: J48, Naïve Base, IBK, Random Forest, Random Tree, and REPTree. In these experiments, the 10-fold Cross Validation is used to validate to obtained classification results. A comparison of the accuracy is illustrated in Fig. 3. In terms of the accuracy, according to the obtained results, the RandomForest algorithm scored the first place at an accuracy of 99.1%. The second best accuracy was achieved by J48 with an accuracy of 97.71%. The IBK and Radom Tree and the algorithms scored the third best at an accuracy around 97%. The lowest accuracy was achieved by NaïveBayes in which the accuracy was 90.0%.

By considering the other effectiveness measurements to validate the obtained results, it can be realized from Fig. 4 that RandomForest algorithm is the most optimal algorithm at a value of 99.1%. The J48 algorithm scored the second best highest value for precision and recall at a value of 97.7%. The IBK and RandomTree algorithms scored the third and fourth best, respectively. The lowest value for the algorithm validation was achieved by the NaïveBayes algorithm. Overall, the obtained results of the selected algorithms are valid. However, the RandomForest algorithm is the most effective one. This indicates that when the model predicts the death cases, it is correct 97.7% of the times. As for the recall, it confirmed that the model correctly identifies 97.7% of all the death and recovered cases. Again, comparing between the selected algorithms, F-measure confirmed that the RandomForest algorithm is the most optimal one with a value of 99%, see Table 2. The ROC Curve is one of the most important techniques for checking the effectiveness of the model as well the algorithm. Concerning this measure, as shown that the RandomForest scored the highest result, which confirmed the validity of the most optimal algorithms.

Fig. 3
figure 3

Accuracy of the selected algorithms

Fig. 4
figure 4

Precision and recall of the selected algorithms

Table 2 Detailed accuracy of the selected algorithms

As shown in Fig. 5A, patients with less than 59 years old were more likely to improve and recover from the COVID-19 infection, while in patients older than 59 years, the strongest prediction factor of death was increased mean cell volume, particularly if they were older than 83 years. For patients between the age 59 and 87, low Eosinophils was the strongest predictive factor of death. Another decision three shown in Fig. 5B found another set of important predictive factors, which also included age, mean cell hemoglobin, monocyte count, mean cell volume and hematocrit. In this analysis, it is found that patients who were less than 79 years old but have high mean cell hemoglobin were more likely to die. On the other hand, patients with low mean cell hemoglobin were more likely to die if they have lower mean cell volume. Older patients with low monocyte count were also of risk to die.

It was noticed that there is relationship between deaths related specific blood groups and the death, as illustrated in Fig. 6. Women with blood type O-positive had good prognosis and improved, while almost 40% of male patients with blood type O-positive died. Although the cohort was largely women and blood type O-positive is common in Oman [52], these results still represent a clear disparity in disease outcome between male and female based on the blood type. As for other blood groups, A-positive blood type was the second important predictor after O-positive blood type. In both male and female patients, A-positive blood type was slightly associated with bad prognosis when accompanied with abnormal mean cell volume. In this regard, males also were more affected.

Fig. 5
figure 5

Experimental results 1: CBC parameters

Fig. 6
figure 6

Experimental results 2: the blood types

Generally, the obtained results confirmed previous studies that report the association of severe COVID-19 disease and death with abnormalities in blood parameters such as hemoglobin, hematocrit, WBC counts, electrolytes as well as with some comorbidities such as chronic kidney disease [53]. The study revealed that patients who were less than 79 years old and high mean cell hemoglobin were more likely to die [54]. In contrast, if the low mean cell hemoglobin is associated with lower mean cell volume, patient will have high risk of death. Moreover, older patients with low monocyte count were also at risk of severe disease. It is highlighted that infected patients who were less than 59 years are more likely to overcome and survive from the COVID-19 disease.

Table 3 shows detailed death information by Blood Group parameter. Around 356 (78.9%) of the patients had blood type O+ , whereas 43 (9.5%) had blood type A + , 46 (10.2.2%) had blood B + , and only 6 (1.3%) had blood type B−. Among the 356 patients who were admitted to the hospital and had blood type O + , 314 (88.2%) of them have improved and only 9% of them had died. The majority of the patients who had died were of type A + and B + with percentage of 51.2% and 63%, respectively.

Table 3 Detailed information of the blood group parameter

Analysis of blood types revealed that 78.9% of patients had blood type O-positive, whereas 9.5% had blood type A-positive, 10.2% had blood B-positive, and only 1.3% had blood type B-negative, see Table 3. The statistical analysis shows that among the 356 patients who were admitted to the hospital and had blood type O-positive, 314 (88.2%) of them have improved and only 9% of them had died. Most of the patients who had died were of type A-positive and B-positive with percentage of 51.2% and 63%, respectively. These results are consistent with multiple reports that showed an increased risk of severe COVID-19 disease among non-O types [55].

Most of the subjects in this study were anemic. Anemia has multiple causes and can be associated with many other conditions such as kidney diseases and hypothyroidism. It has been shown in various studies that low level anemia is associated with worsening pneumonia in patients with COVID-19 [54]. Moreover, most of subjects suffered from microcytic anemia, as evidence by low mean cell volume. Microcystic anemia can be caused by iron deficiency, chronic inflammation, and thalassemia. On the other hand, monocytopenia is commonly associated with acute infections. In SARS-CoV-2 infection, monocytopenia is associated with mild and severe disease, particularly in patients with chronic conditions [56]. Eosinopenia is also common in many respiratory diseases including COVID-19. Several studies demonstrated that eosinopenia is associated with severe COIVD-19 diseases and therefore can be reliably used as a prognostic marker [57, 58]. Although it is not clear how infection with SARS-Cov-2 causes eosinopenia, but it is thought that eosinopenia can be by multiple triggers of acute inflammation, which is very relevant in SARS-Cov-2 infection [57, 58].

Several studies have reported a relationship between blood type and COVID-19 disease outcome in which non-O patients are more prone for severe disease [59, 60]. Most of the subjects in this study were O-positive female. The experimental results found that COVID-19 disease outcome in O-positive patient is largely dependent on the gender. Male patients with blood type O-positive were more likely to die as compare to their female counterparts. The same observation has been noticed in patients with blood type A-positive. It unclear, however, how specific blood types contribute to the severity of a disease in male patients. On the other hand, depending on the abnormality of the mean cell volume, the blood type A-positive for both genders was slightly associated with bad prognosis.

To sum-up, this study revealed that, the Age, Hemoglobin, Mean Cell Volume, and Eosinophil are the most significant factors in predicting the progression of the disease and the final outcome. Haematological manifestations such as low lymphocyte and eosinophil numbers have prognostic significance and it is highly prevalent in COVID-19 patients. The reasons behind the drop in these parameters are summarized as follow:

  • Low eosinophil count (eosinopenia) is very common in many respiratory diseases including COVID-19. Several studies demonstrated that eosinopenia is associated with severe COVID-19 diseases and therefore can be reliably used as a prognostic marker. Although it is not clear how infection with SARS-Cov-2 causes eosinopenia, but it is thought that eosinopenia can be by multiple triggers of acute inflammation, which is very relevant in SARS-Cov-2 infection [57, 58].

  • Low hemoglobin level mean anemia. Anemia has multiple causes and can be associated with many other conditions such as kidney diseases and hypothyroidism. It has been shown in various studies that low level anemia is associated with worsening pneumonia in patients with COVID-19 [54].

  • Mean cell volume is the basically the measurement of the average size of the red blood cells. Most of the subjects in this study had MCV less than 78 fL, which is characteristic of microcytic anemia. Microcystic anemia can be caused by iron deficiency, chronic inflammation, and thalassemia.

  • Low monocyte count (monocytopenia) is commonly associated with acute infections. In SARS-CoV-2 infection, monocytopenia is associated with mild and severe disease, particularly in patients with chronic conditions [56].

5 Conclusion

This study employed the KDD methodology to experiment the use of machine learning to predict the outcome of Coronavirus (COVID-19/SARS-CoV-2) patients. The results of the experiments give two outcomes; one is confirming the previously published reports and papers, and the other is revealing some interesting patterns of the disease progression. Abnormalities in various blood parameters are associated with death from COVID-19. Low thrombocyte, neutrophil count and hemoglobin are all reported by other researchers to be associated with bad COVID-19 prognosis. The analysis in this study has confirmed that patients are more likely to die if they have abnormalities related to kidney functions such as Blood Sodium Level, Blood Chloride Level and Serum Creatinine. The study revealed that patients who were less than 79 years old but have high mean cell hemoglobin were more likely to die [54]. On the other hand, patients with low mean cell hemoglobin were more likely to die if they have lower mean cell volume (fL). Moreover, older patients with low monocyte count were also of risk to die. There is no much worries for the patients who are less than 59 years old. This group of patients are more likely to overcome and survive from the COVID-19 disease.

It is worth to mention that there is a relationship between blood type and worsening of the disease. Such an association is largely dependent on the gender. Male patients with blood type O + and A + are more likely to die as compare to their female counterparts. Depending on the abnormality of the mean cell volume, the A + blood type for both genders was slightly associated with bad prognosis. Like most of the other studies, it has been confirmed that hemoglobin is one of leading death factor [54]. The study highlighted a questionable finding which is that a quite large number of patients (96.5%) had died, although their Blood Sodium, which was determined as the cause of death, was within the normal range (136- 145 mmol/L).

The robust findings of this research lack other important parameters about the chronic diseases such as pneumonia, COPD, chronic renal disease, and diabetes. This is because different laboratory tests were requested for different patients depending on the patient’s health condition. Providing such parameters and others would have assist in identifying the relationship between blood parameters, chronic diseases and disease progression with the patient status. Moreover, it would assist in reshaping the COVID-19 treatment protocol. Future work should consider obtaining the comprehensive parameters and then conduct intensive experiments and validation with big data from Oman and other GCC countries. It should also consider employing artificial intelligence algorithm to optimize the overall classification accuracy.

6 Implication

The contribution of this study is classified into two types: theoretical and practical contribution. Theoretically, many essential practical contributions were proposed. The research findings should help the decision makers in the health institutions to quantify the risks, health benefits and cost-effectiveness of COVID-19. The obtained results can also help decision makers to take necessary procedures to safe patients’ life. The treatment protocol should be restructured to minimize the death risk. Thus, whenever an infected in-patients approach the hospital, first and foremost, the Mean Cell hemoglobin pictograms (Mean Cell Hb pg) needs to be checked particularly if the in-patients are older than 59 years old. According to the Ministry of Health in Oman, the normal range for the Mean Cell Hb pg is between 26 and 33 pg. Thus, a special attention needs to be given to patients who are between the age of 59 and 79 years old and their Mean Cell Hb is below 26 pg. This is because such a value is considered as abnormal-low and it indicates that less amount of hemoglobin present per red blood cell. Symptoms includes but not limited to shortness of breath, body tiredness, chest pain and a fast heart rate. Ultimately, it might cause death if the Mean Cell Hb pg is not set back to normal. In case the Mean Cell Hb pg drops below 24.1 pg and the Mean Cell Volume femtoliters (fl) fall below 78 fl, there is high possibility that patients might lose their live if there is no rapid medical intervention to ensure that Mean Cell Volume femtoliters is within the normal rage (78–96). Concerning the in-patients who are older than 79 years old, the initial medical checkup should spot the light on the Monocytes. This is because the Monocytes is critical component of the innate immune system and it is actively contributing in the processes of inflammatory and anti-inflammatory during an immune response. Consequently, if it drops below 0.6 in its value, this might weakness the immune system in tackling strange elements that invade the human body. There is also a possibility of death if the Monocytes is higher than 0.6 but the Haematocrit of Blood is not within the normal range (35- 45). The blood group is one of the criteria that must be tested as well to speed up the treatment protocol and minimize the death risk. It has been revealed that there is an association between death and specific blood groups; therefore, the necessary measures must be taken to reduce the risk. Women with blood group A or B and Mean Cell Hb less than 26 ph are at high risk of death compared to the other blood Besides, men with blood group O or A or B are at high risk of death. Accordingly, high risk blood group must be given priority in treatment. Following such a treatment protocol will contribute positively in reshaping the COVID-19 management protocol by improving the strategic decision of the treatment system. Practically, the developed reusable model can be easily employ to predict the status of future patients: high risk (death) and low risk (recovered). This will assist in speeding up the diagnosis and the treatment.