1 Introduction

Diabetes, also known as Diabetes Mellitus, has risen as a significant public health problem worldwide, particularly in lower and moderate-income countries Diabetes is a familiar disease as 80% of the people of the world is suffered from it according to the International Diabetes Fedaration, Belgium  [1, 2]. This disease is an assemblage of metabolic disorders that are distinguished by high blood sugar. Diabetes occurs when the body can not process the food we eat and increase the sugar level [3]. This disorder can prompt many serious long-term, complicated diseases; concurrently, it can cause blindness, stroke, heart attacks, kidney failure, and lower limb amputation. Diabetes and Cardiovascular diseases (CVD) are two of the most prevalent and frequent chronic diseases which lead to death in the United States  [4]. In 2015, approximately 9% of the United States population had been determined to have diabetes, while an additional 3% were undiscovered  [5]. Moreover, nearly 34% had prediabetes [6]. Furthermore, around 90% of adults with prediabetes were unaware of their condition. The number of diabetic patients increased from 122 million to 422 million between 1980 and 2014  [7]. The estimation will be struck to 642 million approximately in 2040  [8]. In 2016, an approximation of 1.6 million deaths was caused by diabetes which is a great range, while 2.2 million passings were inferable from high blood glucose in 2012  [9]. Moreover, the frequency of diabetes in Kuwait, Bahrain, Jordan and the United Arab Emirates were 12.8%, 14.9%, 17.1%, and 20.1%, respectively [10, 11]. Like other countries, Bangladesh is highly affected by diabetes. The IDF (International Diabetes Federation) Diabetes Atlas 5th edition projected that diabetes prevalence in Bangladesh would increment to over half by the next 15 years, setting Bangladesh as the eighth-most highest diabetic crowded country in the world  [12]. WHO ranked diabetes as the seventh influential reason of death in 2016  [9]. Young adults are likewise profoundly influenced by diabetes. In another statistical report, the global predominance of diabetes among adults has doubled from 4.7 to 8.5% within the year 1980–2014  [12]. Therefore, the number of diabetes patients is expanded daily with a noteworthy range; subsequently, deaths also increase daily, which is alarming to us.

Diabetes can be of different categories: Type I, Type II and Gestational diabetes. Type I diabetes happens to people when an insulin-producing cell named beta cell is attacked by our immune system [13]. As 90% of the beta cells are permanently damaged through this attack, the pancreas can not produce enough insulin. For this reason, it is also called insulin-dependent diabetes. Among all the diabetes patients, only 5–10% of people have type I diabetes. This type of diabetes is also called juvenile-onset diabetes, as it may occur at the early stage of our life. The next type of diabetes is Type II diabetes which is called non-insulin-dependent diabetes [14]. As it does not respond to the insulin produced by the pancreas, the body becomes resistant to insulin. This type of disease is also known as adult-onset diabetes, as it affects young and older people. Approximately 90% of people have type II diabetes. The last type of diabetes is gestational diabetes. It generally occurs to the women during their pregnancy [15]. However, it affects the baby’s growth rather than affecting the mother. 2–10% of women might have gestational diabetes during their pregnancy [16]. Generally, the disease goes away after pregnancy, but it may also cause type II diabetes.

The analysis of diabetes data is difficult because a large portion of the clinical information is nonlinear, non-normal, relationship organised, and complex  [17]. On the other hand, the traditional clinical methods to detect diabetes are time-consuming and inconvenient. Therefore, healthcare professionals need time-efficient, simple and open diabetes detection systems that can precisely identify the stage of diabetes which the patient holds. In the world of constantly increasing data, where hospitals are gradually embracing big data systems  [18], there are incredible advantages to utilise advanced technology in the medical sector to give experiences, augment diagnosis, improve results, and decrease costs  [19, 20].

Machine learning has significant impacts on medical healthcare. This technology enhances the efficiency of the medical health care system [21]. A considerable amount of research has been done to predict or identify diabetes using a machine learning system. To detect diabetes mellitus, Maniruzzaman et al.  [22] applied four classifiers and LR (logistic regression) models as the most risk-identifiable factors. Their model provides 94.25% accuracy with a combination of LR-based features and Random Forest (RF) based classifiers. Dinh et al.  [23] used 123 variables and the eXtreme Gradient Boost (XGBoost) model, which helps to achieve 86.2% without using laboratory data and 95.7% with laboratory data. In addition to machine learning, deep learning algorithms have also been used to detect diabetes [24]. The authors used a combination of convolutional neural network (CNN) and long short-term memory (LSTM) and gained an accuracy of 95.7%. However, these approaches might provide the inaccurate and biased result if we apply the algorithms on the dataset without analysing them properly [25]. Therefore, we need to analyse and adjust our data according to the standard model. In this case, statistical analysis will assist us in analysing our data and providing direction in achieving a better result in diabetes prediction.

Thus, in this study, the purpose of applying the statistical approach is to obtain the significant features that may prompt improved accuracy in detecting diabetes. This research aims to identify the most significant factors and treat people accurately diagnosed with diabetes. Analysing the above research work, we aim to reveal the impact of various factors on diabetes by performing a quantitative analysis. We look to achieve the following research objectives:

  • We investigated both demographic and clinical characteristics to explain the estimation of diabetes.

  • We analyse and determine the most significant risk factors of diabetes diseases using the Logistic Regression.

The remainder of the paper is organised as follows. Section  2 presents related work. The methodology is presented in Sect.  3. Result analysis and Discussion are illustrated in Sect.  4. Section  5 finally concludes the paper.

2 Related Work

Diabetes is a chronic disease where a person suffers from an extensive blood glucose level in their body and may cause many complications. Several researchers have adopted machine learning (ML) techniques in healthcare analysis in the last few decades, especially cancer, brain tumour, and diabetes detection. The primary motivation behind the engagement of ML models is to improve the healthcare systems [26,27,28]. For example, Sneha  [29] proposed an ML model that aims to focus on attribute selection, which helps detect diabetes mellitus using predictive analysis. Their performance returned 98.20% highest accuracy with the combination of the decision tree algorithm and Random forest. Sisodia  [30] sketched a model that can forecast the likelihood of diabetes in patients with the highest accuracy. Contrastingly, Ozcift  [31] proposed an ensemble approach named “Rotation forest”, merging 30 machine learning methods. Han et al.  [32] preferred another ML-based model, which alters the SVM prediction rules. Therefore, we included the listing of existing ML work in Table  1.

Table 1 State of the art of diabetes detection models

Although existing studies show various ML-based models for diagnosing diabetes diseases that successfully demonstrated and estimated their performance, however, none of them were able to achieve 100% accuracy in terms of diabetes prediction. One of the main reasons behind this argument is that the dataset was not adequately analysed. Introducing statistical analysis in diabetes prediction might lessen the biasness and increase the improvement of the result. For example, Afroz et al. [40] assessed the diabetic patient of Bangladesh. However, the authors applied only the odds ratio to determine the factors causing diabetes. Camara et al. [41] evaluated the reason for diabetes in Cameroon and Guinea. Nevertheless, the authors only applied a non-adjusted odds ratio to find the significant factors that cause diabetes. Therefore, our current study has researched a diabetes dataset, and our goal is to identify the significant factors that cause diabetes by applying logistic regression and find the odds ratio as well as adjusted odds ratio of the factors. Additionally, we have calculated the p-value for selecting critical features to ameliorate the detection accuracy and determine the relationship between the predictor and the result class. In our last stage, we have determined the importance of factors by using Artificial Neural Network (ANN).

3 Research Methodology

Diabetes is a major metabolic disorder that affects the entire body system skeptically [42]. However, medical datasets are often larger in dimensions with complex redundant features, which increases the possibility of noise and dependency among the features and reduces the performance and accuracy. Hence data preprocessing plays a significant role in performing machine learning tasks with medical datasets [30, 43]. The process of reducing the dimensionality would be either feature selection or feature extraction. This work aims to demonstrate proper features which are significant risk factors for determining diabetes diseases.

In this study, we have used a diabetic dataset in Bangladesh. The dataset is freely available. After the data collection, we present our performing model to investigate the effect of selected factors on diabetics. The process flow of the proposed framework is represented in Fig. 1. We have divided our proposed framework into three segments: Data Preprocessing, Data Analysis and Outcome. In the data preprocessing segment, we remove the null values and unevenly distributed factors. In the next segment, we analyse our data based on demographical and clinical characteristics. Then, we calculate each factor’s odds ratio and adjusted odds ratio using logistic regression. At the final stage, we define the significant factors that produce an impact on diabetes. Finally, we determine the importance of the factors using the ANN.

Fig. 1
figure 1

Overview of the Proposed Statistical Model

3.1 Data Collection and Processing

The diabetes dataset used in this study were collected from Bangladesh Demographic and Health Survey (BDHS), which is available on The DHS Program website [44]. The data was consists of 1564 participants having nominal and ordinal factors. In this dataset, two patients have zero cm of arm circumference, systolic blood pressure is zero in three patients, two patients have zero diastolic blood pressure, and two patients have zero kg of weight. These zero values are represented as missing or null values. Thus, we have to eliminate the null values in the preprocessing step. Thus, our final dataset has 1555 participants’ data. The dataset has 132 diabetic patients and 1423 controls. The descriptions of the attributes and brief statistical summary are shown in Tables 2 and 3, respectively.

Table 2 Description of attributes of the dataset

From Table 2, we observe that the attributes: age, systolic blood pressure, diastolic blood pressure, height, weight, and arm circumference has numeric values. Thus, we do not need to encode these attributes. However, we had to encode the other features as they have categorical values like string or boolean values. To find the difference between diabetic patients and control patients, we conducted an independent two-sample t-test for continuous variables and a Pearson Chi-square test for categorical variables, illustrated in Table 3. From this table, we can see that most of the patients were from Dhaka. However, the percentage of rural people were more than the urban. We have also categorised the age into five classes based on the paper [45]. Our dataset contains the data of the people having age limits from 35 to 54 years old. We also removed the gender feature as the data percentage of the female class is 0.26% only. We also removed the Electricity and Take medicine feature as the data were unevenly distributed. Conclusively, our final dataset is consists of 1555 data, 13 features and one target variable.

3.2 Model Specification

There are various machine learning based works for identifying diabetes disease. However in this study, our main concern is identifying the risk factors along with diabetes detection. As our dependent variables are binary and discrete, we use a logistic regression model. The logistic regression model can be specified as follows:

$$\begin{aligned} P (Y=1|X)&= f\left( x_{1},x_{2},x_{3},....x_{n}\right) \\&= \frac{1}{1+{e}^{-\left( \beta _0+\beta _1 X_1+...+\beta _n X_n\right) } } \end{aligned}$$

This can also be expressed as:

$$\begin{aligned} Logit (p)&= log\frac{1}{1-p} \\&= {\left( \beta _0+\beta _1 X_1+...+\beta _n X_n\right) } \end{aligned}$$

where \(\beta _0\) is the intercept, \(X_i\) are a set of predictor variables, and \(\beta _i\) are the regression coefficients associated with the ith predictor. In the above, p is the probability of a change in life insured status and \(\frac{1}{1-p}\) is known as the odds ratio. \(\beta _i\) ives an estimate of change in the logodds associated with a unit change in the predictor variable. The parameters in the model are estimated using the method of maximum likelihood estimation (MLE).

4 Result and Discussion

In the following subsections, we briefly present the various outcomes which were identified during the quantitative phase of the study. We then determine the effect of selected features on the diabetics using logistic regression.

Table 3 Demographic and clinical characteristics of the study population

4.1 Result of Quantitative Analysis

The dataset contains both continuous and categorical data. The demographical characteristics of the participants contain only categorical data, while the clinical characteristics of the participants are composed of both continuous and categorical data. Here, the continuous variables are expressed as mean ± SD, and the categorical variables are expressed as n(%). We calculated the p-value for both continuous variable and categorical variable by performing two different tests. In this study, we refer to the level of significance as 5% (p < 0.05).

To find the difference between diabetic patients and control patients, we illustrated the t-test and Pearson Chi-Square test result of categorical and continuous values in Table 3. From this table, it is notable that the factors Residence, Wealth Index, Education, Working Status, Smoking Status, Arm Circumference, Weight and BMI group has a p-value of less than 0.05. This result proves that these factors are statistically significant as there are significant differences in these factors between the diabetic group and the control group. However, we can notice that the factors Region, Age Group, SBP, DBP and Height are not statistically significant as their p-value is more than 0.05. This statement indicates no significant differences in these factors between the diabetic and control groups. To better understand the relationship between factors along this dimension, we computed the similarity score between the factors. Figure 2 presents the correlation matrix, which illustrates the association between the factors. From this figure, we can easily interpret the correlated and associated factors that assist us in predicting diabetes disease. The systolic and diastolic blood pressures and the weight with the BMI group are highly interconnected to each other. Moreover, the education with wealth index; and the weight with the arm circumference are moderately interconnected. However, the smoking status with both education and wealth index are inappreciably associated with one another.

Fig. 2
figure 2

Correlation between Factors using Similarity Score

Applying logistic regression, we calculated the odds ratio, adjusted odds ratio, p-value, and confidence interval for each factor illustrated in Table 4. These values will be crucial to identify the key risk factors of having diabetes. We choose the low-frequency variables or negative classifiers as our reference group. Thus, for example, for the first factor, ”Region”, we choose Barisal as our reference group, and we determine the odds ratio and adjusted odds ratio based on this reference group. From the table, it is visible that a person from the Sylhet region is 1.1 times more likely to be diabetic than the reference region Barisal. However, a person from the Khulna region is 0.6 times less prone to be diabetic than the reference region Barisal. Moreover, Urban people are 0.9 times less expected to be diabetic compared with reference residence group rural. Besides, diabetes is more inclined to the richest category and less possible among the middle category compared to the reference category, ”poorest”. It is also seen that the probability of having diabetes is increasing with age and also education. Also, The odds of having diabetes disease are 0.133 times lesser for working people than non-working people. However, we notice that the odds of having diabetes disease is 0.419 times lower for a smoker than for a non-smoker.

Table 4 Effect of selected factors on the diabetes using logistic regression

For the clinical characteristics, all the predictors are continuous except the BMI group. A person with normal and overweight is more likely to be diabetic, and an obese person is less likely to be diabetic compared with the reference group underweight. Since the adjusted odds ratio of arm circumference and weight are more than one, we can state that an increase in these factors leads to an increased probability of having diabetes. From Table 4, it can be concluded that ”Working Status” and ”Smoking Status” are statistically significant factors for diabetes disease at a 5% level of significance because for both OR and AOR the p-values are smaller than 0.05.. The rest of the 11 factors are insignificant in diabetes detection as the p-values in AOR were greater than 0.05. Finally, in Figure 3 we have shown the importance of the factors which are determined by ANN.

Fig. 3
figure 3

Importance of factors calculated by Artificial Neural Network (ANN)

4.2 Discussion

In this research, we have tried to determine the factors that cause diabetes. The factors of our dataset are divided according to the demographic and clinical characteristics of the study population. We can suggest that a person’s working and smoking status are more critical in prompting diabetes from the experiments. However, the arm circumference, weight and the BMI of a person also confers particular importance in this research. Overall, the characteristics mentioned above are essential in efficient diabetes detection though the working and smoking status work as the best factor when we measure based on a definite objective.

Among the demographic characteristics, residence, wealth index, education, working status and smoking status has more importance than the other factors. The other factors like region and age group have a p-value of more than 0.05, which clearly explains they are nonsignificant in diabetes detection. The risk of possessing diabetes is higher in people with higher education [46]. Moreover, a study in the UK found that people with a lower wealth are prone to diabetes [47]. Thus we can say these factors are associated with people’s lifestyle and can influence diabetes. The impact of the clinical characteristic is enormous. Arm circumference, Weight and BMI are significantly essential for resulting in diabetes in a person. The previous study found that BMI has a strong association with diabetes and obese individuals are more likely to have diabetes [48]. However, the blood pressure and height has a p-value greater than .05. Thus, these factors have no significant difference between the diabetic and control group.

This research has several advantages over previous researches. We have developed a statistical model to identify the factors causing diabetes efficiently at the early stage of the disease. Our paper’s main objective was to analyse the diabetes dataset from quantitative points of view. For this reason, our proposed statistical framework will be significant towards the statistics of the diabetes analysis. Moreover, we have found several key factors that directly lead to diabetes. More wealthy people and working people are more likely to have diabetes. Another interesting finding from our research is that people who smoke are less likely to have diabetes which contradicts the previous studies [49]. Consequently, our findings illustrate that different characteristics and factors are related to a different frequency of diabetes. Furthermore, our work is not confined to determining the association of the predictors with the diabetes class; we also discover the correlation among the factors by calculating the similarity score. In future, we have to choose the factors carefully while prognosticating diabetes.

This work has few shortcomings too. The dataset does not provide any additional information about the family or ancestor of the diabetic person. This additional information might be a critical factor in diabetes prediction. Moreover, we have to ignore the gender feature as the gender data is unevenly distributed. According to the studies, men have higher incidences of diabetes than women [50]. Thus, gender could be another key feature that we ignored in our study due to the uneven distribution. Furthermore, we have particularly used regression analysis to detect diabetes and prove our hypotheses. This methodological limitation might be solved if we could have applied the classification-based analysis.

To conclude, we should analyse both demographic and clinical characteristics in explaining the estimation of diabetes. The clinical characteristics are the most significant predictors in describing the emergence of diabetes. Moreover, demographic characteristics are likewise necessary for predicting diabetes. Smoking status, working status, weight and BMI plays a vital role in determining the disease accurately. Thus, non-smoker, non-working, and obese people hold a high probability of having diabetes.

5 Conclusion

Diabetes is a familiar disease as 80% of the people of the world is suffered from it as reported by the International Diabetes Fedaration, Belgium. It is essential to discover the causes behind diabetes and understand the pattern of the disease. In this paper, we studied the demographical and clinical characteristics of 1555 participants from Bangladesh and found the significant factors that cause diabetes. Using logistic regression, we computed the odds ratio and the adjusted odds ratio, and found the relation between the factors and the outcome. We found that notable significant characteristics (smoking status and working status) demonstrated the diabetes class as the p-value for both these factors were less than 0.05 in the OR and AOR calculation. Moreover, the odds ratio for smoking class was 0.419 times lower and the working class was 0.133 times lower than the non-smoking and non working class respectively. Finally, this study presents the diabetes pattern and will be able to assist the healthcare professionals in prompt decision-making.