Introduction

The increasing incidence of diabetes worldwide is of great concern and has attracted the attention of many researchers [1, 2]. All types of diabetes pose a high risk of premature death and are a serious problem. Official statistics published by the World Health Organization indicate that 422 million people [3] were suffering from this disease in 2014; this number is forecast to rise to more than 690 million people in 2045 [4]. In Kazakhstan alone, the number of diabetic patients is believed to exceed 300,000, and this figure only includes patients who were directly diagnosed by doctors [5]. There are serious problems in this country due to a lack of qualified specialists in diabetes, meaning that diabetes is often first treated during the advanced rather the early stages of the disease. These shortcomings have led to an increase in diabetic patients in Kazakhstan, and diabetes mellitus (DM) is currently the fourth most prevalent disease in the country [6]. DM is thus a growing public health problem that affects not only human health but the health care system overall and, indeed, the global economy [7].

The steady growth in the number of diabetes patients has prompted researchers around the world to explore methods permitting the prediction and early diagnosis of diabetes. For instance, the prevalence of chronic kidney disease in European patients with diabetes until 2025 was predicted in [8], and a demographic epidemiological model of Singapore was devised in [9] and then used to predict the overall prevalence of type 2 diabetes in Singapore until 2050. While measures to tackle the rising prevalence of diabetes can and are being taken at the state level, multisectoral efforts to treat this disease are needed to optimize socioeconomic productivity. The authors of [10] argue that the pandemic of diabetes cannot be solved without the participation of all of the stakeholders concerned, including diabetic patients and the community.

A model for predicting type 2 diabetes based on data-mining methods was proposed in [11]. This model consisted of an improved k-means algorithm and a logistic regression algorithm. Other researchers have developed a model for predicting the prevalence and incidence of obesity and diabetes as well as the direct costs of treating diabetes and its complications [12]. In another study, a population-based analysis of an elderly cohort was carried out to investigate whether oral antidiabetic agent use can decrease the risk of dementia in type 2 diabetes patients, and the correlation of the incidence of dementia to the duration of diabetes was explored [13]. In Taiwan, models for estimating the diabetes-associated risk of hospitalization and the risk of type 2 diabetes inpatient mortality were devised in order to facilitate the identification of at-risk patients [14]. Three forecasting machines were used in [15] to anticipate the glycemic impacts of various meals: a data assimilation machine, a model averaging the data assimilation results, and a machine utilizing dynamic Gaussian process model regression. The forecasted glycemic impacts were found to correlate well with glucose indicators, and the prediction accuracy of the technique was as good or better than expected. Computer simulations can also help researchers to understand chronic disease progression. In this context, the authors of [16] developed and used system dynamics to perform diabetes system modeling.

An important direction in the development of medical services for populations is the construction and implementation of various problem-oriented information systems that can utilize all of the heterogeneous information collected during the diagnosis and treatment of patients with diabetes and apply “big data” technology and cloud services as a toolkit. There is a high demand from modern medical institutions for such systems that use the latest information technologies to facilitate the diagnosis and treatment of diabetes mellitus [17,18,19,20].

The purpose of the study described below was to identify the most effective regression analysis method for predicting the growth in the number of patients with diabetes in Kazakhstan using ins passive detection and real statistical data on such patients

Methods

Statistical Analysis

Data on diabetic patients in Kazakhstan were provided by a public foundation, the Kazakh Society for the Study of Diabetes [5], which provided informed consent for the publication of all statistical data used in this study.

Data on gross domestic product (GDP) and the population of Kazakhstan were taken from the official website of the Statistical Agency of the Republic of Kazakhstan [21]. The aims of this study was to build a model that could predict the growth in the number of diabetes patients in Kazakhstan via passive detection and regression analysis methods, and to identify the most accurate experimental method for predicting diabetes. Data on patients with diabetes in Kazakhstan from 2004 to 2018 (see Table 1) were used.

Table 1 The total number of patients with diabetes mellitus (DM) in Kazakhstan during each of the last 15 years

From 2004 to 2018, the number of patients with diabetes increased from approximately 114,000 to approximately 326,000 (185.46%), as shown in Fig. 1.

Fig. 1
figure 1

Data from the register of patients with DM in Kazakhstan

Based on this graph, we can conclude that there is a positive growth trend in the number of diabetic patients in Kazakhstan. The largest jump in the number of diabetic patients was observed in 2014—an increase of 35,251 people (15.58%).

Results

Correlating Population Growth with the Number of DM Patients and the Increase in the Gross Regional Product with the Growth in the Number of DM Patients in Kazakhstan

It is often necessary to explore the relationship between continuous variables. This can be probed using correlation analysis, as illustrated by Table 2, which presents the correlation between population growth and the number of DM patients in each region of Kazakhstan. The final result of correlation analysis is a correlation coefficient (r), the value of which can range from − 1 to + 1. A correlation coefficient of + 1 indicates that there is a strong positive linear relationship between two variables, a correlation coefficient of − 1 indicates that the variables have a strong negative linear relationship, and a correlation coefficient of 0 means that there is no linear relationship between the variables [21,22,23,24].

Table 2 The correlation coefficient for population growth versus number of DM patients in each region of Kazakhstan

The regions of Kostanay, North Kazakhstan, and East Kazakhstan were all found to show relatively strong negative correlations between population growth and the number of DM patients, while the Akmola region showed a weak negative correlation between those parameters. This shows that there are strong negative linear relationships between population growth and the number of DM patients in Kostanay, North Kazakhstan, and East Kazakhstan, and that there is a weak negative linear relationship between these parameters for the Akmola region. Positive correlations between the two variables are seen for the other regions of Kazakhstan.

The correlation between the growth in the gross regional product (GRP) and the growth in the number of diabetic patients in each region of the country was also analyzed; the corresponding correlation coefficients are shown in Table 3. GRP is a general indicator of the economic activity of the region, i.e., the amount of goods and services produced in that region [25].

Table 3 The correlation coefficient for growth in GRP versus growth in the number of patients with DM in each region of Kazakhstan

According to Table 3, there is a strong positive linear relationship between the growth in GRP and the growth in the number of patients with DM in each region of Kazakhstan.

Literature Review of Methods Used to Construct Predictive Models

The standard tool used in medical research (indeed, in all areas of research) to explore correlations between variables is regression analysis [26, 27]. For instance, the authors of [28] performed studies to detect anomalies in surveillance data and concluded that while the number of studies that use more sophisticated methods such as machine learning methods and hidden Markov models is increasing, studies that use traditional methods such as control charts and linear regression remain more popular.

In [29], predictive methods based on machine learning were compared with those based on traditional statistical methods. The empirical results of this comparison highlighted the need for objective and unbiased approaches to testing the performance of forecasting methods. This can be achieved by comparing the predictions afforded by the various forecasting methods when they are all applied to the same task, and by analyzing a large dataset (e.g., a large number of time series in the present work), as this should lead to fair and meaningful comparisons and definite conclusions.

The application of methods based on regression analysis to build predictive models will be successful if there is a known correlation between two variables of interest. Our correlation analysis revealed that there was a strong positive linear relationship between growth in GRP and growth in the number of patients with DM in Kazakhstan. The next step was to apply three types of regression analysis to predict the growth in the number of patients with diabetes mellitus based on passive detection: linear regression, polynomial regression, and exponential regression. If the value of one of the parameters considered is known to a high level of accuracy, we can use these three regression equations to determine the value of another parameter that is related to the first parameter [30].

Forecasting the Growth in the Number of DM Patients in Kazakhstan in 2019 Using Three Types of Regression Analysis

Regression methods are statistical methods for studying the distribution of a dependent variable in relation to one or more independent variables [31]. The aim of regression analysis is to build a mathematical model that allows the value of a dependent variable to be estimated from the values of independent variables [32]. Such a model incorporates regression coefficients that are identified by constructing a regression line—a line of best fit to the distribution of the dependent variable in relation to the independent variable(s). In the present work, we used various types of regression lines—linear, third-degree polynomial, and exponential—to achieve the best fit to the distribution. In each case, the best variant of the regression equation was chosen by identifying the variant with highest coefficient of determination R2 [30]. Many methods of determining the parametric relationship between a dependent variable and independent variables have been developed. These methods usually differ in the shape of the function used in parametric regression and the distribution of the error term in the regression model. Examples include linear regression, logistic regression, and Poisson regression [33].

In the present work, we applied the three regression methods to a situation with one dependent and one independent variable using the machine-learning library scikit-learn of the programming language Python. Particular attention was paid to verifying that the conditions required for the appropriate application of the methods were present.

  1. 1.

    In linear regression analysis, the parameters of a straight line that can be used to accurately predict the value of one variable based on the value of the other variable are predicted.

The straight line has the formula

$$ y = \beta_{0} + \beta_{1} x, $$

where y is the value of one of the variables, \( \beta_{0} \) is the point at which the straight line crosses the y-axis, \( \beta_{1} \) is the slope of the line, and x is the value of the other variable. Linear regression analysis is performed if correlation analysis reveals a relationship between the variables [24, 34]. The linear regression equation that was used as a model for predicting the number of DM patients took the following form:

$$ y = 15915x + 78368 ,{\text{with}}\,R^{2} = \,0.9804. $$
  1. 2.

    Polynomials are widely used in situations where a curvilinear response is observed. Even the most complex nonlinear relationships can be adequately modeled by polynomials across a fairly narrow range of x values.

A regression equation based on a third-degree polynomial takes the following form:

$$ y = ax^{3} + bx^{2} + cx + d, $$

where the number of extrema (maxima, minima, and inflection points) presented by the curve is determined by the degree of the polynomial [34, 35]. The polynomial regression equation that was used as a model for predicting the number of DM patients took the following form:

$$ y = - 38.378x_{3} + 1487.6x_{2} + 780.81x + 113349,\,{\text{with}}\,R^{2} = 0.9964. $$
  1. 3.

    Exponential regression involves regression functions of the following form:

$$ y = a^{*} m^{x} = a^{*} (^{{}} {\text{e}}^{\ln \left( m \right)} )^{x} = a^{*} {\text{e}}^{x*\ln \left( m \right)} = a^{*} {\text{e}}^{bx} ,\,{\text{where}}\,b = \ln (m). $$

The exponential regression equation used as a model for predicting the number of DM patients took the following form [36]:

$$ y = 102666{\text{e}}^{0.0796x} ,\,{\text{where}}\,R^{2} = \,0.995. $$

After calculating the regression equations, they were used to predict the number of DM patients in each region of Kazakhstan in the year 2019; these data are presented in Table 4.

Table 4 The number of DM patients in each region of Kazakhstan in the year 2019, as predicted using three regression analysis methods

According to the predicted data for 2019 obtained using linear regression, there will be 333,010 DM patients in Kazakhstan. According to polynomial regression, there will be 350,074 DM patients in Kazakhstan in 2019, but, according to exponential regression, there will be 369,945 DM patients. After obtaining these data, the regression model plot shown in Fig. 2 was generated.

Fig. 2
figure 2

Plot showing the number of DM patients in Kazakhstan each year from 2004 to 2018, as well as three different regression lines fitted to the data. The regression lines were used to predict the number of DM patients in Kazakhstan in 2019

As shown in Fig. 2, all three types of regression had high coefficients of determination, i.e., \( R^{2} \) was always above 0.9, although polynomial regression yielded the highest \( R^{2} \) value. From this, it follows that the polynomial model is best suited for use as a model for predicting the number of DM patients.

Relationship of Population Growth and GDP to the Growth in the Number of Patients with DM

A regression analysis was performed to determine the relationship of population growth and GDP to the growth in the number of diabetic patients in Kazakhstan. The model used took the following form:

$$ y = a + b_{1} x_{1} + b_{2} x_{2} + \varepsilon . $$

More precisely, it was found to be

$$ y = - 1 2 8 4 3 8+ 1 4. 4 7 9 1 3x_{ 1} + \, 0.00 3 2 6 8x_{2} + \varepsilon , $$

where y is the number of patients with DM, and R2 = 0.98 is the coefficient of determination. This is the proportion of variance of the dependent variable, explained by the model of dependence under consideration, i.e., the explanatory variables. The determination coefficient (R2; 0 ≤ R2 ≤ 1) is a measure of the quality of the regression model; i.e., how well it describes the relationship between the dependent and independent variables of the model. The closer the value of the coefficient of determination is to 1, the better the model. If R2 = 1, then the empirical points (xi; yi) lie exactly on the regression line and there is a linear functional relationship between variables Y and X. If R2 = 0, then all of the variation of the dependent variable is due to factors not taken into account in the model.

In the present research, the model shows the relationship between growth in GDP and the DM population. x1 is the population. We used the F test to determine the statistical significance of all the coefficients. F was calculated to be 496.4881, meaning that all of the coefficients were statistically significant. x2 is the GDP, and the constant a = − 128,438. There can be functions where one variable depends on the values of two or more other variables, where x1 and x2 together determine the value of y. The value of a shows that if x1 and x2 are equal to zero then y will equal to zero too. In the present research, a has a negative value, which means that if there is no population growth and the economy expands, the number of patients with DM will decrease. b1 = 14.47913, which shows that iif the economy does not expand, the number of DM patients will increase by approximately 14,000 people as compared with the number in the previous year. b2 = 0.003268, which shows that if the population continues to grow at the same rate the following year and the rate of GDP increases by 1 million tenge, the number of DM patients will be increasing by this ratio.

Using the scikit-learn Library of Python to Model Regression Methods

One of the most popular programming languages, Python, can be used to implement machine learning algorithms [37]. Python has a well-documented library named scikit-learn [38] that can be employed for machine learning. The scikit-learn library can be applied to tasks such as clustering, cross-validation, correlation, dimension reduction, algorithmic compositions, feature extraction, feature selection, optimization of algorithm parameters, and multiple learning.

To demonstrate the utilization of this library, let us consider an example in which it is used to determine the number of patients with diabetes based on data obtained from the statistical data register of the Republic of Kazakhstan. The first step is to download the required data. The scikit-learn library is used to model data, not to download the data. However, the Pandas library [39] can be used to download data as it has convenient functions for I/O and the processing of tabular data, and this library can also perform primary data analysis.

Data loading code:

figure a

The next step is to work with arrays, where X is an array of signs Y is an array of classes. The subsequent step is to normalize features, since most machine-learning algorithms are based on gradient methods. After downloading the necessary data, researchers can use the capabilities of machine-learning algorithms. The scikit-learn library implements a variety of algorithms, such as logistic regression, linear regression, naive Bayes, k-nearest neighbors, decision trees, and the support vector method. The scikit‐learn library can easily be integrated into applications that perform traditional statistical data analysis, as well as other types of applications, as it relies on the Python scientific software ecosystem. Algorithms implemented in a high-level language can be used as building blocks in a range of applications, such as in medical imaging [40].

The first experiment, which used a linear regression algorithm, employed the following code:

figure b

After training the model, it is easy to predict the number of patients according to the input attribute using the predict method. In the second experiment, the following polynomial regression algorithm was used:

figure c

Finally, a third experiment that applied the following exponential regression algorithm was carried out:

figure d

Discussion

The authors examined the main features of the scikit-learn library that were used to solve machine-learning problems. Results obtained from Python were then compared with those obtained using Excel. In comparison, it was found that all three regression methods yielded similar predicted values regardless of whether Python or Excel was used, although there was a difference of 16 patients between the results obtained with exponential regression using Python (369,961 patients) and Excel (369,945 patients). Since the values predicted using the different regression analysis methods and Excel or Python are rather similar to each other and are quite close to the actual DM populations reported for Kazakhstan in recent years, and given the approximate nature of these forecasting techniques, it appears that it is feasible to use regression analysis methods to accurately predict the DM population in Kazakhstan.

The regression model would be more reliable if the statistical data for DM patients in Kazakhstan were obtained on a monthly basis rather than an annual basis. This is one limitation of this study.

Conclusion

In this work, we reviewed many studies that used regression analysis for forecasting purposes. This review led us to conclude that regression analysis methods are effective techniques for solving various problems, including many in the field of medicine. We therefore tested three different regression models as possible tools for predicting the number of patients with diabetes in Kazakhstan in 2019. All of the models indicated that the number of DM patients will increase, which is concerning. Strong correlations of population growth and GDP with the growth in the number of patients with diabetes in Kazakhstan were observed, and the relationship between population growth and the number of diabetics was determined. A correlation between the growth in GRP and the growth in the number of patients with diabetes was also discerned. The main features of the scikit-learn library that can be applied to machine learning problems were considered in the context of predicting the number of patients with diabetes in Kazakhstan using regression analysis methods. This research is part of a larger research project that focuses not only on the prediction of the number of diabetic patients in 2019 but also on the diagnosis and study of diabetes using big data technologies. Although some results of this research have been studied and published, investigations into the use of big data technology in the health sector are ongoing.