Correlating Population Growth with the Number of DM Patients and the Increase in the Gross Regional Product with the Growth in the Number of DM Patients in Kazakhstan
It is often necessary to explore the relationship between continuous variables. This can be probed using correlation analysis, as illustrated by Table 2, which presents the correlation between population growth and the number of DM patients in each region of Kazakhstan. The final result of correlation analysis is a correlation coefficient (r), the value of which can range from − 1 to + 1. A correlation coefficient of + 1 indicates that there is a strong positive linear relationship between two variables, a correlation coefficient of − 1 indicates that the variables have a strong negative linear relationship, and a correlation coefficient of 0 means that there is no linear relationship between the variables [21,22,23,24].
Table 2 The correlation coefficient for population growth versus number of DM patients in each region of Kazakhstan The regions of Kostanay, North Kazakhstan, and East Kazakhstan were all found to show relatively strong negative correlations between population growth and the number of DM patients, while the Akmola region showed a weak negative correlation between those parameters. This shows that there are strong negative linear relationships between population growth and the number of DM patients in Kostanay, North Kazakhstan, and East Kazakhstan, and that there is a weak negative linear relationship between these parameters for the Akmola region. Positive correlations between the two variables are seen for the other regions of Kazakhstan.
The correlation between the growth in the gross regional product (GRP) and the growth in the number of diabetic patients in each region of the country was also analyzed; the corresponding correlation coefficients are shown in Table 3. GRP is a general indicator of the economic activity of the region, i.e., the amount of goods and services produced in that region [25].
Table 3 The correlation coefficient for growth in GRP versus growth in the number of patients with DM in each region of Kazakhstan According to Table 3, there is a strong positive linear relationship between the growth in GRP and the growth in the number of patients with DM in each region of Kazakhstan.
Literature Review of Methods Used to Construct Predictive Models
The standard tool used in medical research (indeed, in all areas of research) to explore correlations between variables is regression analysis [26, 27]. For instance, the authors of [28] performed studies to detect anomalies in surveillance data and concluded that while the number of studies that use more sophisticated methods such as machine learning methods and hidden Markov models is increasing, studies that use traditional methods such as control charts and linear regression remain more popular.
In [29], predictive methods based on machine learning were compared with those based on traditional statistical methods. The empirical results of this comparison highlighted the need for objective and unbiased approaches to testing the performance of forecasting methods. This can be achieved by comparing the predictions afforded by the various forecasting methods when they are all applied to the same task, and by analyzing a large dataset (e.g., a large number of time series in the present work), as this should lead to fair and meaningful comparisons and definite conclusions.
The application of methods based on regression analysis to build predictive models will be successful if there is a known correlation between two variables of interest. Our correlation analysis revealed that there was a strong positive linear relationship between growth in GRP and growth in the number of patients with DM in Kazakhstan. The next step was to apply three types of regression analysis to predict the growth in the number of patients with diabetes mellitus based on passive detection: linear regression, polynomial regression, and exponential regression. If the value of one of the parameters considered is known to a high level of accuracy, we can use these three regression equations to determine the value of another parameter that is related to the first parameter [30].
Forecasting the Growth in the Number of DM Patients in Kazakhstan in 2019 Using Three Types of Regression Analysis
Regression methods are statistical methods for studying the distribution of a dependent variable in relation to one or more independent variables [31]. The aim of regression analysis is to build a mathematical model that allows the value of a dependent variable to be estimated from the values of independent variables [32]. Such a model incorporates regression coefficients that are identified by constructing a regression line—a line of best fit to the distribution of the dependent variable in relation to the independent variable(s). In the present work, we used various types of regression lines—linear, third-degree polynomial, and exponential—to achieve the best fit to the distribution. In each case, the best variant of the regression equation was chosen by identifying the variant with highest coefficient of determination R2 [30]. Many methods of determining the parametric relationship between a dependent variable and independent variables have been developed. These methods usually differ in the shape of the function used in parametric regression and the distribution of the error term in the regression model. Examples include linear regression, logistic regression, and Poisson regression [33].
In the present work, we applied the three regression methods to a situation with one dependent and one independent variable using the machine-learning library scikit-learn of the programming language Python. Particular attention was paid to verifying that the conditions required for the appropriate application of the methods were present.
- 1.
In linear regression analysis, the parameters of a straight line that can be used to accurately predict the value of one variable based on the value of the other variable are predicted.
The straight line has the formula
$$ y = \beta_{0} + \beta_{1} x, $$
where y is the value of one of the variables, \( \beta_{0} \) is the point at which the straight line crosses the y-axis, \( \beta_{1} \) is the slope of the line, and x is the value of the other variable. Linear regression analysis is performed if correlation analysis reveals a relationship between the variables [24, 34]. The linear regression equation that was used as a model for predicting the number of DM patients took the following form:
$$ y = 15915x + 78368 ,{\text{with}}\,R^{2} = \,0.9804. $$
- 2.
Polynomials are widely used in situations where a curvilinear response is observed. Even the most complex nonlinear relationships can be adequately modeled by polynomials across a fairly narrow range of x values.
A regression equation based on a third-degree polynomial takes the following form:
$$ y = ax^{3} + bx^{2} + cx + d, $$
where the number of extrema (maxima, minima, and inflection points) presented by the curve is determined by the degree of the polynomial [34, 35]. The polynomial regression equation that was used as a model for predicting the number of DM patients took the following form:
$$ y = - 38.378x_{3} + 1487.6x_{2} + 780.81x + 113349,\,{\text{with}}\,R^{2} = 0.9964. $$
- 3.
Exponential regression involves regression functions of the following form:
$$ y = a^{*} m^{x} = a^{*} (^{{}} {\text{e}}^{\ln \left( m \right)} )^{x} = a^{*} {\text{e}}^{x*\ln \left( m \right)} = a^{*} {\text{e}}^{bx} ,\,{\text{where}}\,b = \ln (m). $$
The exponential regression equation used as a model for predicting the number of DM patients took the following form [36]:
$$ y = 102666{\text{e}}^{0.0796x} ,\,{\text{where}}\,R^{2} = \,0.995. $$
After calculating the regression equations, they were used to predict the number of DM patients in each region of Kazakhstan in the year 2019; these data are presented in Table 4.
Table 4 The number of DM patients in each region of Kazakhstan in the year 2019, as predicted using three regression analysis methods According to the predicted data for 2019 obtained using linear regression, there will be 333,010 DM patients in Kazakhstan. According to polynomial regression, there will be 350,074 DM patients in Kazakhstan in 2019, but, according to exponential regression, there will be 369,945 DM patients. After obtaining these data, the regression model plot shown in Fig. 2 was generated.
As shown in Fig. 2, all three types of regression had high coefficients of determination, i.e., \( R^{2} \) was always above 0.9, although polynomial regression yielded the highest \( R^{2} \) value. From this, it follows that the polynomial model is best suited for use as a model for predicting the number of DM patients.
Relationship of Population Growth and GDP to the Growth in the Number of Patients with DM
A regression analysis was performed to determine the relationship of population growth and GDP to the growth in the number of diabetic patients in Kazakhstan. The model used took the following form:
$$ y = a + b_{1} x_{1} + b_{2} x_{2} + \varepsilon . $$
More precisely, it was found to be
$$ y = - 1 2 8 4 3 8+ 1 4. 4 7 9 1 3x_{ 1} + \, 0.00 3 2 6 8x_{2} + \varepsilon , $$
where y is the number of patients with DM, and R2 = 0.98 is the coefficient of determination. This is the proportion of variance of the dependent variable, explained by the model of dependence under consideration, i.e., the explanatory variables. The determination coefficient (R2; 0 ≤ R2 ≤ 1) is a measure of the quality of the regression model; i.e., how well it describes the relationship between the dependent and independent variables of the model. The closer the value of the coefficient of determination is to 1, the better the model. If R2 = 1, then the empirical points (xi; yi) lie exactly on the regression line and there is a linear functional relationship between variables Y and X. If R2 = 0, then all of the variation of the dependent variable is due to factors not taken into account in the model.
In the present research, the model shows the relationship between growth in GDP and the DM population. x1 is the population. We used the F test to determine the statistical significance of all the coefficients. F was calculated to be 496.4881, meaning that all of the coefficients were statistically significant. x2 is the GDP, and the constant a = − 128,438. There can be functions where one variable depends on the values of two or more other variables, where x1 and x2 together determine the value of y. The value of a shows that if x1 and x2 are equal to zero then y will equal to zero too. In the present research, a has a negative value, which means that if there is no population growth and the economy expands, the number of patients with DM will decrease. b1 = 14.47913, which shows that iif the economy does not expand, the number of DM patients will increase by approximately 14,000 people as compared with the number in the previous year. b2 = 0.003268, which shows that if the population continues to grow at the same rate the following year and the rate of GDP increases by 1 million tenge, the number of DM patients will be increasing by this ratio.
Using the scikit-learn Library of Python to Model Regression Methods
One of the most popular programming languages, Python, can be used to implement machine learning algorithms [37]. Python has a well-documented library named scikit-learn [38] that can be employed for machine learning. The scikit-learn library can be applied to tasks such as clustering, cross-validation, correlation, dimension reduction, algorithmic compositions, feature extraction, feature selection, optimization of algorithm parameters, and multiple learning.
To demonstrate the utilization of this library, let us consider an example in which it is used to determine the number of patients with diabetes based on data obtained from the statistical data register of the Republic of Kazakhstan. The first step is to download the required data. The scikit-learn library is used to model data, not to download the data. However, the Pandas library [39] can be used to download data as it has convenient functions for I/O and the processing of tabular data, and this library can also perform primary data analysis.
Data loading code:
The next step is to work with arrays, where X is an array of signs Y is an array of classes. The subsequent step is to normalize features, since most machine-learning algorithms are based on gradient methods. After downloading the necessary data, researchers can use the capabilities of machine-learning algorithms. The scikit-learn library implements a variety of algorithms, such as logistic regression, linear regression, naive Bayes, k-nearest neighbors, decision trees, and the support vector method. The scikit‐learn library can easily be integrated into applications that perform traditional statistical data analysis, as well as other types of applications, as it relies on the Python scientific software ecosystem. Algorithms implemented in a high-level language can be used as building blocks in a range of applications, such as in medical imaging [40].
The first experiment, which used a linear regression algorithm, employed the following code:
After training the model, it is easy to predict the number of patients according to the input attribute using the predict method. In the second experiment, the following polynomial regression algorithm was used:
Finally, a third experiment that applied the following exponential regression algorithm was carried out: