Introduction

In December 2019, in Wuhan, Hubei Province, China the first case of COVID-19 was reported on a person suffering from severe flu-like illness. The pathogen behind the disease was identified in January 2020 as a novel coronavirus, subsequently named SARS-CoV-2 which stands for Severe Acute Respiratory Syndrome Coronavirus-2. Later, the term "COVID-19" where ’CO’ stands for ’corona’, ’VI’ for ’virus’, and ’D’ for disease, and 19 represents the year of its occurrence, i.e., 2019 was coined by the World Health Organization (WHO) in February 2020. The COVID-19 pandemic has surfaced as a crucial threat to public health worldwide. It has had a drastic impact on the economic stability and social life of various countries across the globe and has also highlighted the functioning of their respective societies and governments while taking measures to curb the spread of the disease.

Many researchers working in the field of machine learning and artificial intelligence are using the expertise to analyze the entire epidemic situation in the ecology by constructing various mathematical models with the help of the available nationwide data set to the disease [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15]. Transmission and propagation of different diseases have been an important application of mathematical modeling for many years. Artificial intelligence, on the other hand, is an important technique used for prediction [16, 17]. It has several applications in the fields of machine learning, computer vision, fraud detection, robotics, etc. Machine learning algorithms have a lot of applications in various fields of mathematics, physics, and biology, so understanding its basic concepts is of extreme significance amongst researchers today [2]. Machine learning models have been previously used to predict the number of upcoming patients infected by COVID-19 where the following four techniques were used: linear regression (LR), support vector machine (SVM), least absolute shrinkage and selection operator (LASSO), and exponential smoothing (ES) [18]. Various machine learning, as well as ensemble learning-based models like Decision tree, random forest, SVM, XG boost, Linear regression, kernel ridge, multi-level perceptron, and few others, were used to investigate and understand the real effect of temperature and humidity on the spread of COVID-19 [19]. The results showed that the proposed technique employing machine learning classifiers gave more accuracy for precise and smart farming with good crop management and thereby will serve helpful for the government to make the decision and formulate policies for the stack of farmers, and consumers [20]. Recently, Kumari et al. [21] have done a detailed study of various forecasting models for the prediction of the number of confirmed cases, recovered cases, and total number of deaths in India due to COVID-19. The authors have used the correlation coefficient and multiple linear regression for predicting their results and further used autocorrelation and autoregression techniques to improve accuracy. Gupta et al. [22] have predicted the confirmed death and cured cases of COVID-19 using multiclass classification. They performed data cleansing and selecting features followed by forecasting using machine learning algorithms like random forest, linear model, support vector machine, decision tree, and neural network. On the basis of their results, they concluded that the random forest was the best technique for prediction and analysis has it outperformed all others in terms of accuracy for the given data set. Shaikh et al. [23] have carried out an in-depth analysis of the outbreak of COVID-19 in India to determine the optimal regression model that can be used for the prediction of confirmed recovered and death cases in India for a particular time period. A linear and a polynomial regression model has been implemented, and R-squared score and error values have been used to evaluate. Hassan et al. [24] have described four standard models, namely, neural network, SVM, Bayesian network, and polynomial regression to track and predict the number of people that have been infected, recovered, and have died due to COVID-19. To further evaluate the performance of these algorithms, root-mean-squared error, mean squared error, mean absolute error, explained variance score, and R-squared score have been employed. The authors [25] have employed a fine-tuned ensemble classification technique to forecast the death rate and recovery rate of COVID-19 patients across various states of India. After application of the classification model, the performance of the different state-of-the-art classifiers for the described model is evaluated and the results are established accordingly. The results show that the proposed model had outperformed the other ones like Decision Tree, Gaussian Naïve Bayes, and Support Vector Machines in giving a more accurate prediction of COVID-19 cases. Another study was conducted on Artificial Intelligence-based meta-analysis to forecast the global effects of COVID-19 [26]. From the results, we could conclude that the random forest algorithm gave a good prediction for estimating the number of future cases that may arise in case of a sudden outbreak like COVID-19. Various other research works have also been conducted by employing different machine learning algorithms to specifically predict certain results based on the trends and available data set for COVID-19 [27].

In this paper, we have used a few supervised machine learning algorithms for the study followed by the estimation and finally the prediction of the dependency of the death rate due to COVID-19 on various other factors. This article proposes to study and describe a tool to predict COVID-19 mortality risks using machine learning algorithms while simultaneously giving perception into the distinct risks that different socio-demographic and socio-economic groups face from COVID-19. We have analyzed the boosted RF model, XGBoost model, and SVM model to forecast the COVID-19 patient mortality rate as a benchmark for predictive performance on a wide spectrum of nationwide available data. Unlike the other models stated in the literature survey, the model that we have described in our study uses a wide range of variables (precisely 16 variables) for our analysis to predict the desired target value. Normally, if we go through the literature survey, this many number of variables has rarely been used, and hence, our research work has a significant contribution for prediction of deaths due to COVID-19 using a wide variety of variables. We intended to incorporate a few more variables in our study, but due to many missing values in the original data set, a few data features could not be employed for our analysis. Another significant contribution that the results will help the government in formulating various policies and ground rules along with adequate arrangement of proper facilities and infrastructure for containing the spread of COVID-19 and thereby reducing the number of deaths due to the disease.

The rest of this paper is structured as follows: Section “Methodologies” describes the methodology of the component models individually. In Section “Experimental Evaluations”, we have discussed about the data set adopted in the article, and the results of forecasting models applied in the study and related discussion about models forecasting performance are mentioned. We end our study with Section “Conclusion”, which is reserved for discussion and the conclusion.

Fig. 1
figure 1

Workflow diagram for Random Forest

Fig. 2
figure 2

Workflow diagram for SVM

Fig. 3
figure 3

Workflow diagram for XGBoost

Methodologies

In this section, we have discussed about the brief description of various models, which has been implemented here for data analysis.

Linear regression normally is the most basic regression model that one uses to start understanding machine learning, but a major restriction with this technique is that it can be only applicable if the solution is linear and hence cannot be applied everywhere. Also, it is based on the assumption that the input features are mutually independent thereby, might not result in giving the most accurate prediction. Lasso regression is just another variant of linear regression that uses shrinkage. However, lasso regression struggles with some types of data. Also, if there are two or more highly collinear variables, then lasso regression will select any one of them randomly which is not good for accurate interpretation of the data. On the contrary, if we consider a decision tree or random forest algorithm, it supports non linearity which is the primary reason for using those techniques for our study over linear regression. Another technique that we have used here is support vector machine as the kernel function used in this technique can again support both linear and nonlinear solutions. Support vector machine also handles outliers better than linear regression which is an added advantage. Coming to the XG boost technique, we have considered it for our study because of its execution speed and a better overall performance compared to linear or lasso regression. It is generally observed that XG boost algorithm fits training data much more accurately than the latter ones mentioned.

Random Forest

Random Forest in itself is a full-fledged and complete independent learning technique used for various tasks like classification, regression, etc. This computational technique is considered quite effective as it can quickly function on large data sets. In recent times, it is very much in demand because of its application in a diverse range of research projects, including real-world practical applications [20]. A random forest follows an approach of merging various randomized decision trees followed by an averaging technique used to collect their predictions. This RF technique has shown exceptional results in problems where the number of variables involved exceeds the number of observations considerably. RF being a very versatile and adaptable technique can be implemented on a wide variety of problems. Some of the advantages of this machine learning technique is its non-parametric nature, efficient interpret ability, and a very high accuracy of prediction for most data sets. Now, while describing the RF algorithm the user has to set certain hyperparameters prior to its implementation. For instance, how many observations have to be incorporated in each of the decision trees, those observations are drawn with or without replacement, what kind of splitting rule is followed, the minimum number of observations contained in a node, how many decision trees to be used in this random forest to get accurate predictions while avoiding unnecessarily lengthy computations. The method of finding optimal hyperparameters of any given data set for machine learning algorithm is known as tuning. Random forest models are created using a technique known as bagging in which each of the decision trees is employed as parallel estimators. Therefore, the number of trees in a random forest is considered as a control parameter for modeling this algorithm. One advantage here is that increasing the number of trees in a random forest does not result in overfitting. Though after a point the accuracy of predictions is no longer improvised on adding additional trees without having any negative impact on the recorded information. This just increases the computational time. Workflow diagram for Random Forest is given in Fig. 1

Support vector machine (SVM)

Another widely used supervised machine learning technique is SVM. It is primarily used for classification of data; however, it can be used for regression also. A model using SVM algorithm attempts to solve an optimization problem when determining the decision boundary [28]. While doing that, it follows certain objectives which are given below:

  • To increase the distance of decision boundary to support vectors.

  • Try to maximize the total number of points that are classified correctly in the training data set.

It is very obvious here that there is a communication between the above mentioned goals. This is because one might have to keep the decision boundaries in very close proximity to a specific class or support vector to maximize the number of accurately tagged data points. This is the first type of approach used in SVM. One disadvantage of this approach is that it is very sensitive to minute changes and noise occurring in the independent variables. This phenomena is referred to as overfitting. The second approach here is that we might keep the decision boundary as far as possible from the support vector at the expense of getting some inaccurately classified data points. Workflow diagram for SVM is given in Fig. 2.

XGBoost

Extreme boost or extreme gradient boosting is an expandable gradient boosting decision tree supervised machine learning technique [29]. It serves as one of the dominant machine learning library for problems based on regression, classification, and ranking. The main concepts of machine learning that are required to build the XG Boost algorithm include decision trees, ensemble learning, supervised machine learning, and general gradient boosting technique. The fundamental idea behind improvising machine learning models is to integrate thousands of prediction models with low accuracy to establish a high accuracy model. For a large data set, the model might need to be integrated way too many times to reach this desired accuracy. Therefore, under such circumstances, the XGBoost models are found to be extremely useful.

We define the objective function of a XG Boost model in the following manner:

$$\begin{aligned} O\!\!\!\!/_{k} = \sum _{i=1}^{n} l\left( \left( y_i,{y_i^k-1}\right) +f_k\left( x_i\right) \right) +\Omega \left( f_k\right) , \end{aligned}$$
(1)

where n denotes the sample size, the number of iterations is given by k, and \(f_k\) represents the error obtained in k iterations. l here is used for the cost function. \(\Omega\) here denotes the damaged function of the model. The cost function in a XGBoost model calculates the difference between the prediction and label in the previous step as well as measures the output obtained from the new tree. In this algorithm, we use a Taylor function of second order for data classification of the model. The main use of introducing the second-order Taylors function is to expand the loss function and optimize the objective function in such a manner that it is as close as possible to the actual value, hence improvising the prediction accuracy. Workflow diagram for XGBoost is given in Fig. 3.

Decision Tree

A decision tree strategy is a widely used technique which is quite significant in categorizing multiple covariant systems [30]. They are also used for developing algorithms to predict a target variable. In this method, we classify a given population into subclasses or segments resembling the branches of an inverted tree with root nodes and leaf nodes. Being a non-parametric algorithm, it has a very simple and user friendly structure to deal with large and complicated sets of data. Decision algorithm despite belonging to the class of supervised learning algorithms can be used for solving classification as well as regression problems. Some important terms used while constructing a decision tree are given below:

Based on the most important attribute a decision tree divides the complete data set into smaller subsets. The root node is used to represent the most important covariate in the data set, and then, it is further split to subnodes commonly known as decision nodes. While constructing a decision tree, we divide the data set into regions that are disjoint and homogeneous. It looks in the form of an inverted tree. The uppermost part represents all the observations at a single node which then divides into two or more subnodes, and each of those subnodes further splits into more subnotes. This procedure is said to follow a greedy approach as it only concentrates on the current node without giving any focus to the future nodes. The two main parameters used in a decision tree algorithm are the Max depth parameter and minimum impurity decrease parameter. The Max depth parameter is used for controlling the depth of a decision tree. Therefore, the nodes keep on splitting until it reaches to the value of the Max depth parameter, thereby indicating the algorithm to stop automatically. There does not exist any optimal value for the Max depth parameter as it is very specific to the data set being considered. Probability of overfitting increases as the decision tree tends to get deeper. Sometimes, one may set up considerably small value as the Max step parameter due to which the model might fail to record enough information about the data set. This phenomenon is referred to as underfitting and avoiding it is considered a feasible option. After the splitting is done, the primary objective of this algorithm is to decrease the extent of impurity. The informative power of a split is directly proportional to the decrease in impurity. Also, as the tree gets deeper, the decrease tends to become lower. Hence, we can use this parameter to control the tree from doing further splits by assigning a threshold value. After we have established the decision tree, we shall try to improvise it by pruning the tree which means by removal of unwanted data like outliers or noisy data.

Experimental Evaluations

In this section, the COVID-19 death cases have been fitted with our model. We have also checked the accuracy of our models by three metrics, namely MAE, RMSE, and R Squared.

Data

The dataset used in the study was downloaded from https://ourworldindata.org/covid-deaths#explore-the-global-data-on-confirmed-covid-19-deaths. The dataset contains 165870 observation with 67 attributes for all the countries. The observation range for the data set is up to 03.03.2022 from the beginning of the epidemic. Figure 4 depicts the death cases of Covid 19 for all the countries. It indicates the COVID-19 cumulative death cases for all the countries. Due to the presence of many missing values for some data features, a selection of 16 variables was used for our analysis. We consider the "New deaths per million" as the target value and it is made as feature. After deleting "NA" values, we get 10052 data points with 15 possible casual variables (Reproduction rate, Icu patients, People vaccinated per hundred, People fully vaccinated per hundred, Stringency index, Population density, Aged 60 older, Aged 70 older, GDP per capita, Cardiovascular death rate, Diabetes prevalence, Female smokers, Male smokers, Hospital beds per hundred, and Human development index) and one response variable "New deaths per million" to find the optimal decision tree. Figure 5 illustrates the correlation matrix between the target value and other features. It shows that target value (New deaths per million) is highly correlated with icu patients, stringency index, cardiovascular death rate, people vaccinated with single dose, and people vaccinated with double dose.

Next, similar criteria were followed to extract the dataset of the US,India, Italy, and three continents Asia, Europe, and North America from   https://ourworldindata.org/covid-deaths. To conduct the next part analysis, We take a country/continent specific subset from the above mentioned data set with two attributes “date” & the response variable “New deaths per million”. We have selected three continents Asia, Europe, and North America to get maximum number of data points after deleting the NA values from the dataset. We have chosen US, India, & Italy as the countries contribute leading number of COVID-19 cases in their respective continents. Here, the last 26 days are used for model testing and the remaining data samples are used as training dataset for SVM, Random Forest, and XGBoost Combined model work flow diagram for the proposed models has presented in Fig. 6.

Performance Assessment

To evaluate the performance of our models, we have used three metrics, namely MAE, RMSE, MSE and R Squared

$$\begin{aligned} \hbox {MAE}= & {} \dfrac{1}{n}\sum \limits _{t=1}^{n}|e_t| \\ \hbox {RMSE }= & {} \sqrt{\dfrac{1}{n}\sum \limits _{i=1}^{n}e_i^2} \\ \hbox {MSE}= & {} \dfrac{1}{n}\sum \limits _{i=1}^{n}e_i^2, \end{aligned}$$

where \(e_i=z_{i}-y_{i}\), \(z_{i}\) is the predicted value using our model, \(y_{i}\) is the target output, & i is the corresponding data point which varies from 1 to n.

R Squared: R squared is basically the ratio of the sum of squares regression (SSR) and the sum of squares total (SST). The total variation of all the predicted values obtained on the lines of regression from the mean of the complete set of values of the response variables is represented by SSR. While SST denotes the total variation of the actual values from the mean of all the available values of the response variables. R-squared value is a measure of the best-fit line. The formula for calculating it is given below

$$\begin{aligned} R^2 = \frac{\textrm{SSR}}{\textrm{SST}} = \frac{\sum \left( \hat{y}_i - \bar{y}\right) ^2}{\sum (y_i - \bar{y})^2}. \end{aligned}$$
(2)

\(R^2\) is also referred to as the coefficient of determination and its value ranges between 0 and 1. In case any model does not fit to the given algorithm, negative value of \(R^2\) will be obtained. The closer the value of \(R^2\) is to 1, the better is the accuracy of the model.

Results

In this section, machine learning techniques have been applied to the COVID-19 dataset for extracting information and understanding the effect of various variables on the COVID-19 death rate. First, we have used optimal Decision Tree model for the dataset containing 10052 data points with 15 possible casual variables (Reproduction rate, Icu patients, People vaccinated per hundred, People fully vaccinated per hundred, Stringency index, Population density, Aged 60 older, Aged 70 older, GDP per capita, Cardiovascular death rate, Diabetes prevalence, Female smokers, Male smokers, Hospital beds per hundred, and Human development index) and one response variable “New deaths per million”. Decision Tree is implemented in R using “rpart” package to find out high-risk variables from the 15 possible casual variables that are closely related to the COVID-19 death cases. An optimal decision tree is formed with equal costs for the ten variables. The performance assessment metric for the fitted tree is as follows: RMSE = 1.91, \(R^2\) = 0.74, and MAE = 1.24. Figures 7 and  8 provides the fitted tree and variable importance list, respectively. The variable importance plot suggests us the ten most important casual variables that have higher importance than the other variables. The variables which have a higher impact on COVID-19 death cases are as follows: Icu patients, GDP per capita, People fully vaccinated per hundred, Cardiovascular death rate, People vaccinated per hundred, Female smokers, Stringency index, Human development index, Diabetes prevalence, and Male smokers. As usual Icu patients have higher morbidity chances, but our study finds out that Gdp per capita has a higher impact than the Stringency index for COVID-19 death cases. It can be observed that the country with higher GDP per capita has low values for the response variable “New deaths per million”. Vaccination plays a vital role to decrease the mortality rate. People with two vaccination doses have a lower mortality rate than the person with one vaccination dose. Interestingly, our research also depicts that the country with a higher Cardiovascular death rate has a high mortality rate. Smoking is also responsible for higher COVID-19 mortality rates.

In the next part, the machine learning model was built for the extracted data of the US, India, Italy, and three continents Asia, Europe, and North America. We have used the last 26 data points as testing data and the remaining data points as training data sets. For different countries, Random Forest Model, XGBoost Model, and SVM model were implemented to find out the death cases in that region. The goal of this task is to predict and compare the regression models. Each of the models was trained with the features, such as Reproduction rate, Icu patients, People vaccinated per hundred, People fully vaccinated per hundred, Stringency index, Population density, Aged 60 older, Aged 70 older, GDP per capita, Cardiovascular death rate, Diabetes prevalence, Female smokers, Male smokers, Hospital beds per hundred, and Human development index to predict the death case per million people in each country/continent. The performances of the proposed models are evaluated using RMSE, MAE, and R-square metric. The best-fitted model for Europe, North America, and India are SVM, and for Asia and United States are XGBoost, and for Italy is Random Forest, respectively (1). Based on the experimental results, we can claim that the variables: Icu patients, GDP per capita, People fully vaccinated per hundred, Cardiovascular death rate, People vaccinated per hundred, Female smokers, Stringency index, Human development index, Diabetes prevalence, and Male smokers are closely associated with the COVID-19 mortality rates. Table 1 presents the experimental results for the regression models using the tenfold cross-validation (CV) procedure to predict the number of COVID-19 death cases, where the performance of these models is evaluated using various performance evaluation metrics. Figures 9 and  10 illustrate the forecasting ability of our models for the country India, US, Italy, and three continents Asia, Europe, and North America, respectively. We observed that SVM for Europe, North America, & India; XGBoost for Asia & US and Random Forest for Italy predict the results more accurately. However, it is not possible to predict the result accurately using any single method.

Fig. 4
figure 4

Worldwide Cumulative death cases due to COVID-19

Fig. 5
figure 5

Correlation matrix between the target value and other features

Fig. 6
figure 6

Combined workflow model diagram

Fig. 7
figure 7

Optimal tree representing the relationships between the variables and death rate for COVID-19

Fig. 8
figure 8

Important variables’ percentage affecting the death rate for COVID-19

Fig. 9
figure 9

Actual vs predicted forecasts for Countries: India, US, and Italy (Testing data)

Fig. 10
figure 10

Actual vs predicted forecasts for Continents: Asia, Europe, and North America (Testing data)

Table 1 Quantitative measures of performance for different models on the testing data for COVID-19 death cases

Conclusion

In this study, we have proposed a few predictive models to find out the COVID-19 death cases using each country’s input variables which are a combination of Reproduction rate, Icu patients, People vaccinated per hundred, People fully vaccinated per hundred, Stringency index, Population density, Aged 60 older, Aged 70 older, GDP per capita, Cardiovascular death rate, Diabetes prevalence, Female smokers, Male smokers, Hospital beds per hundred, and Human development index. This study includes awareness and understanding of factors that can help decrease or increase the death cases of the Covid 19 which will further help the Government to formulate policies and ensure a proper planning strategy to overcome the disaster.

Previous study [30] suggest us that number of cases, people of age group > 65 years, lockdown period, and hospital beds per 1000 people are responsible for higher case fatality rate, but our study can give a new insight to the researchers about the possible higher mortality rate of COVID-19. Our study suggests that:

  • All the countries should complete the double doses of vaccination for all of their countrymen as early as possible.

  • Special care needs to be taken for cardiovascular patients.

  • Smoking should be prohibited in all the public places.

  • Economic growth or GDP is closely associated with the COVID-19 mortality rate.