Background

Chronic diseases such as diabetes, obesity and cancer are caused by the complex interaction between environmental factors (such as diet, lifestyle, and the built environment) and genetic factors [1,2,3]. To understand the ultimate role of environmental, behavioral, and genetic factors along with their interactions, large-scale population cohorts have been established, mainly in Europe, North America, China, Japan, and Korea [4]. No such large population-based studies currently exist in the Gulf Region [5].

Two large biobank projects were launched, one in Saudi Arabia by the King Abdullah International Medical Research Center’s (KAIMRC) and the second in Qatar, by the Qatar Foundation and the Supreme Council of Health. The Qatar Biobank is a Qatar national population based prospective cohort study which includes the collection of biological samples, with long-term storage of data and samples for future research. The ultimate goal is to allow physicians and researchers to use the data collected from the biobank to conduct a large-scale study of the combined effects of genes, environment, and lifestyle on these diseases, to educate people on risk factors for these common diseases and to study disease incidence patterns and develop new diagnostic and therapeutic approaches. Using this pilot data, we had access to 60 features measured on 1000 Qatari citizens. The variables summarize physical, clinical and biochemical measurements such as age, gender, ethnicity, albumin, transaminase time, calcium, cholesterol, and uric acid.

The aim of this study is to use state-of-the-art statistical and machine learning methods to identify biomarkers for medical conditions; diabetes and obesity in this case, to identify the associated risk factors in Qatari population compared to those previously found in other studies. To the best of our knowledge, this is the first study that has been done on Qatari biobank few months after its release.

Methods

Ethical approval

The study was conducted according to the policies, regulations and guidelines for Research Involving Human of the Qatar Ministry of Public Health. All procedures involving human subjects were approved by the Institutional Review Board of Hamad Medical Corporation in Doha, Qatar. Written informed consent was obtained from all participants prior to their enrollment in the study.

Study population

The Qatar Biobank project is a population based cohort, aiming to prospectively examine 60,000 Qataris and long term residents (≥ 15 years living in Qatar) aged 18 years or more. Details are available in [6]. Briefly, potential participants were contacted via word of mouth or via Qatar Biobank’s website www.qatarbiobank.org.qa. Consented participants visited Qatar Biobank facility at Hamad Medical City Building 17, Doha, Qatar, where they underwent a 5-stage interview, physical and clinic measurement sequence, with an average duration of 3 h. Extensive questionnaires (i.e. health behaviors, medical history, lifestyle characteristics, physical activity, mental health, environmental exposures etc.) and clinical examination (i.e. anthropometric measurements, blood pressure, electrocardiogram, bone density etc.) were administered by trained research personnel at enrollment. Participants were asked to provide biological samples (blood, urine and saliva). Biological samples were sent for analysis at the diagnostic laboratories at Hamad Medical Corporation, Doha, Qatar. All lab equipment was calibrated to ensure precision of results. The measured features comprise of routinely measured clinical biomarkers, for details see [6]. Qatar Biobank is recruiting more participants after completion of the pilot study to be as representative as possible of the eligible Qatari population, with a target of 60,000 study participants [6].

Out of the participants, data of 1305 randomly selected participants was used for the present pilot project. The participants consisted of 661 males (50.65%) and 644 females (49.35%), of which 99% were Qataris and remaining 1% were non-Qatari long term residents. The variables having more than 50% missing values and subjects having more than 9 missing values were removed. The dataset was used for two studies: diabetes and obesity. We denote the samples as dataset \({\mathbf{D}_{\mathbf{t2d}}}\) for diabetes analysis. The samples were divided into two groups: cases (n = 312 subjects having HbA1C% \(\ge\)6.5) and controls (n = 898 subjects having HbA1C% < 6.5). For obesity analysis, the dataset \({\mathbf{D}_{\mathbf{obs}}}\) was divided into two groups: cases (n = 508 subjects with BMI ≥ 25 kg/m2) and controls (n = 224 subjects with 18 ≤ BMI < 25 kg/m2).

Missing value imputation

We identified that 2.81% values of the diabetes dataset and 2.64% values of the obesity dataset were missing. Instead of removing the missing values we decided to approximate missing values using the well-known technique multivariate imputation by chained equations (MICE) implemented in the R package mice [7].

Baseline statistics

The baseline statistics for the two groups of samples were computed using R [8]. First, normality of the variables was tested using Anderson–Darling test in nortest package of R [9]. For a normally distributed variable in both groups, Student’s t-test was used to determine significance of difference in the group means. In this case, the group variance of the variable was calculated using F test. For remaining variables, Mann–Whitney test was used to determine significance of difference in the group means. A reported P value lower than 0.05 indicates the corresponding variable is statistically different in the groups.

Regularization models

In this paper, we have used the elastic net, the glinternet, the lasso projection and hdi methods for linear regression models.

The elastic net

The elastic net is a lasso based statistical method that combines L2 penalty with L1 penalty [10]. The elastic net is a better method compared to lasso as the lasso selects only one variable (randomly) out of a group of variables having high pairwise correlation. We used R package glmnet [11] for computation of coefficients with 10-fold cross validation for training the elastic net model.

One of the drawbacks of the elastic net is that it does not calculate statistical significance of the variables (P values), which motivated us to use methods other than the elastic net as well.

Glinternet

The glinternet is a group-lasso based method developed by Lim and Hastie [12]. The method learns pairwise interactions of variables in linear regression models satisfying strong hierarchy. An interesting feature of this method is its ability to incorporate both continuous and categorical variables at the same time in the model making it a unique method to analyze mixed data. We used R package glinternet [13] for computation of coefficients with tenfold cross validation for training the glinternet interaction model.

The lasso projection

The lasso projection (lasso proj) or de-sparsified lasso is a regularization based method that performs statistical inference of low dimensional parameters with high dimensional data [14]. The method uses low dimension projection approach to construct confidence intervals for the estimated regression parameters. Bühlmann and van de Geer improved the de-sparsified lasso by incorporating misspecifications in linear regression models [15]. We used R package hdi [16] for P value calculations for the lasso projection method.

High-dimensional inference

In case of high-dimensional data \(p>n\), standard covariance tests cannot be used without an estimate of the error standard deviation (\(\Sigma ^2\)). Meinshausen et al. introduced a method for computation of P values and confidence intervals in high-dimensional data [17]. In their approach, the data is split into two groups. Variables are selected in one group using the lasso regularization (the elastic net with tenfold cross validation). The selected variables are then used as predictors in an ordinary least squared regression on the other group to obtain associated P values. We used R package hdi [16] for P value calculation.

Machine learning models

In this section, we briefly summarize the modelling techniques used to generate predictive models and unsupervised clustering methods for the datasets \({\mathbf{D}_{\mathbf{t2d}}}\) and \({\mathbf{D}_{\mathbf{obs}}}\). Our goal is to identify variables, which helps to differentiate cases from controls in the two datasets. For this purpose we used two predictive modelling techniques namely random-forests and gradient boosting machines (GBM), which can capture non-linear interactions and produce models which are interpretable. These models not only provide the importance of each variable w.r.t. the phenotype but also classify unseen samples to cases and controls. We have reported the importance of variables in the predictive models computed by R package caret [18]. The importance of variables was ranked and scaled to a maximum importance of 100 for comparison between different methods. The details of machine learning methods is available in Additional file 1.

Random forests

Random forest belongs to the class of ensemble based supervised learning techniques [19]. Random forest algorithm applies the general technique of bagging or bootstrapped aggregating [20] to decision tree learners. By performing this bootstrapping procedure, we obtain better model performance as it decreases the variance of the model, without increasing bias. This means that though each tree is a weak learner and sensitive to noise within its respective data, the average/majority of many trees is not, as long as the trees are not correlated. Thus, this bootstrap sampling is used to de-correlate the trees by showing them different parts of the dataset. Random forests automatically rank the importance of variables in a classification problem by considering the average Information Gain [19] corresponding to each variable for all the trees. We used R package caret [18] to generate random forest models.

Gradient boosting machine

We used gradient boosting machine another ensemble technique for building a predictive model [21,22,23]. The principle idea behind this algorithm is to construct the new base-learners to be maximally correlated with the negative gradient of the loss function, associated with the whole ensemble. We used R package caret [18] for building a GBM predictive model. Detailed description of the method is provided in [22] and Additional file 1.

Unsupervised learning

We used principal component analysis to perform exploratory analysis to identify variables that contribute to the maximum variance in the data. Such variables can be used as potential biomarkers to classify a new sample as case or control. We have used pca biplots [24] to provide visualization of the variables along with the samples. We used R package stats for building pca biplots [24]. We performed principal component analysis (PCA) using top ten discriminative variables from machine learning methods mentioned above. The plots represent contribution of each variable in the PCs in form of labeled vectors. The angle between two vectors indicates the correlation of the variables. In these plots the colored ellipses represent the density of the two classes.

Survival and risk analysis

Survival analysis

We have applied survival analysis on the prognosis of diabetes in the Qatari population. Survival analysis [25, 26] examines and models the time it takes for events to occur, diabetes in our case. Survival analysis focuses on the distribution of event times. In our analysis, we used it to estimate the distribution of time of diabetes development. The time in the model is considered with reference to the time of birth as shown in Fig. 1. For controls, since diabetes is not developed to the current age, the time is considered to be equal to the current age TC and the data is considered to be right censored as the future time of diabetes development is not known. For cases, the time is considered to be equal to the time of event TD, which is the diagnosis of diabetes. We have used the Kaplan–Meier estimator [27] implemented in the R package survival [28] to estimate the distribution of time of diabetes development.

Fig. 1
figure 1

Prognosis of diabetes in controls and cases

Risk analysis

We have also analyzed event times using Cox proportional hazard model [29], a regression based model, in our study. The model assumes covariates to be linear in the log space. Moreover, the model assumes exponential hazard distribution [30] or constant hazard function i.e. the survival function changes proportionally with each variable. We have performed cox proportional hazard regression analysis for each of the predictor variable independent of the other and also in a multivariate regression. We have used the R package survival [28] for cox proportional hazard regression analysis.

Results

We have applied the aforementioned methods on the study population considering all the participants. We have also performed gender stratified analysis to investigate the impact of gender (see Additional file 2 for details).

Baseline characteristics of the study population

Based on the baseline statistics, age was found very significantly associated with diabetes and obesity. Therefore, age was removed from the dataset and phenotype was age adjusted for rest of the analysis. The baseline characteristics of ten most significant variables differentiating the study population for diabetes and obesity are listed in Table 1. Complete list of baseline characteristics is available in Additional file 3. Triglycerides, BMI, and vitamin D were significantly higher (P values \(2.03\times 10^{-11}\), \(8.00\times 10^{-09}\), and \(1.93\times 10^{-08}\) respectively) whereas chloride, magnesium, albumin, free triiodothyronine, sodium and high density lipoprotein were significantly lower (P values \(4.51\times 10^{-24}\), \(3.50\times 10^{-23}\), \(1.07\times 10^{-10}\), \(1.50\times 10^{-08}\), \(2.17\times 10^{-08},\) and \(5.25\times 10^{-08}\) respectively) in cases compared to controls in the diabetes dataset. Similarly, c-peptide of insulin, triglycerides, HBA1C%, insulin, and uric acid were significantly higher (P-values \(1.95\times 10^{-28}\), \(6.94\times 10^{-25}\), \(1.43\times 10^{-20}\), \(5.19\times 10^{-15}\), \(6.87\times 10^{-13}\), \(1.54\times 10^{-10}\), and \(4.25\times 10^{-08}\) respectively) whereas albumin, high density lipoprotein, magnesium, and total bilirubin were significantly lower (P values \(3.24\times 10^{-10}\), \(3.61\times 10^{-08}\) , and \(7.18\times 10^{-08}\) respectively) in cases compared to controls in the obesity dataset.

Table 1 Baseline characteristics for diabetes and obesity study

Regularization models

Results of the elastic net, the glinternet, the lasso proj and hdi are listed in Table 2 for diabetes and obesity studies. Coefficients (\(\beta\)) are reported for the elastic net and glinternet whereas P values are reported for the lasso proj and hdi. A positive coefficient indicates correlation whereas a negative coefficient indicates inverse correlation of the variable with the phenotype.

Table 2 Significant results of elastic net, glinternet, lasso proj and hdi

We identified magnesium, calcium, high density lipoprotein (HDL-C), phosphorus, chloride, free triiodothyronine, albumin, insulin, and uric acid significant in diabetic subjects using the elastic net and glinternet. We identified magnesium, high density lipoprotein (HDL-C), chloride, free triiodothyronine, insulin, and uric acid (P values \(3.35\times 10^{-10}\), \(3.73\times 10^{-03}\), \(2.99\times 10^{-09}\), \(2.58\times 10^{-03}\), \(1.88\times 10^{-04}\), and \(1.31\times 10^{-05}\) respectively) as significant variables using the lasso proj. We identified magnesium, high density lipoprotein (HDL-C), chloride, insulin, and uric acid (P values \(2.34\times 10^{-09}\), \(6.96\times 10^{-04}\), \(7.43\times 10^{-11}\), \(9.36\times 10^{-02}\), and \(4.05\times 10^{-04}\) respectively) as significant variables using hdi.

Similarly, we identified magnesium, high density lipoprotein, albumin, calcium, c-peptide of insulin, cholesterol, total bilirubin, vitamin D, triglycerides, uric acid, and vitamin B12 significant in obese subjects using the elastic net and glinternet. We identified high density lipoprotein, albumin, cholesterol, vitamin D, uric acid, and vitamin B (P values \(7.46\times 10^{-03}\), \(1.11\times 10^{-05}\), \(1.03\times 10^{-03}\), \(1.22\times 10^{-07}\), and \(1.64\times 10^{-02}\) respectively) as significant variables using the lasso proj. We identified albumin and uric acid (P values \(2.40\times 10^{-09}\) and \(1.52\times 10^{-03}\) respectively) as significant variables using hdi.

Machine learning models

Results of machine learning models are summarized in Fig. 2. For diabetes study, both random forest and GBM have identified magnesium, chloride, c-peptide of insulin, insulin, and uric acid as important variables for predicting diabetes. Similarly, insulin, c-peptide of insulin, albumin, uric acid, and vitamin D were identified as main variables for predicting obesity.

Fig. 2
figure 2

Relative variable importance of top variables of machine learning methods for (a) diabetes and (b) obesity

The PCA biplots of first two principal components (PCs) are shown in Fig. 3. The plots indicate that there are overlapping clusters of cases and controls detected by the first two principal components, which is expected especially in case of diabetes indicating presence of pre-diabetic subjects. For diabetes study, there is a high correlation between magnesium and chloride; free triiodothyronine and LDLC; and c-peptide of insulin and insulin (Fig. 3a). Similarly, for the obesity study there is a high correlation between c-peptide of insulin and insulin; total bilirubin and albumin; and hemoglobin, serum creatinine and uric acid (Fig. 3b).

Fig. 3
figure 3

PCA Biplots for (a) diabetes and (b) obesity studies

Survival and risk analysis

Survival analysis

Figure 4a shows the probability of being non-diabetic (y-axis) in Qatari population at a given age (x-axis). In the plot, the solid line indicates the probability of being non-diabetic (solid line) along with the \(95\%\) confidence intervals (dotted lines). Variation in the probability increases with age due to a large number of uncensored observations thus widening the \(95\%\) confidence interval associated with the probability. The analysis reveals that at the age of 40, there are \(15\%\) chances of developing diabetes in Qatari population and the chances increase to \(50\%\) at the age of 63. We have also analyzed the data by stratifying on the basis of gender. Figure 4b shows the probability of being non-diabetic (y-axis) in Qatari population at a given age (x-axis) for males and females. The results indicate that females are slightly at more risk to diabetes than males before the age of 40 but later on males have more chances to develop diabetes.

Fig. 4
figure 4

Probability of being non-diabetic (y-axis) in Qatari population (a) at a given age (x-axis), and (b) stratified on gender

Risk analysis

We have performed cox proportional hazard regression analysis for each of the predictor variable independent of the other. The results are summarized in Table 3. Here lower p-values, high magnitude of \(\beta\), and high value of Wald test means a variable is playing an important role in the risk of disease. In this case, variables such as calcium, magnesium, hemoglobin, triglycerides, and free-triiodothrymine play a very significant role in determining risk of the disease. The proportionality assumption of each variable must be validated in the model for correct modeling of the data. We have used scaled Schoenfeld Residuals test [31] to check proportionality assumption of each variable. Results of the test are summarized in Additional file 4. Only triglycirides variable violates the proportionality assumption as its p-value is less than the 0.05 threshold. We have investigated the impact of gender and magnesium on the survival as shown in Fig. 5a, b. We have also performed the multivariate cox regression on all the variables together in a multivariate regression setting. The results are shown in Fig. 5c.

Fig. 5
figure 5

Impact on survival probability of (a) gender, (b) magnesium, and (c) all variables

Table 3 Multivariate Cox regression results for diabetes

Discussion

A majority of adults in Qatar are obese or overweight, which is a main risk factor for developing diabetes and between 18.5 and 20% population have been diagnosed with diabetes, according to Qatar Diabetes Association of Qatar Foundation. Both conditions—which are related to each other as well as to heart disease-increased significantly in just 6 years, with the prevalence of diabetes alone jumping nearly \(20\%\) between 2012 and 2016. Although there are a number of factors associated with diabetes and obesity, ranging from genetics to individual behaviors, the metabolomics and other factors have been increasingly implicated in these epidemics. Our study is based on a new data from the 2015 to 2016 Biobank Health Interview Survey, the nation’s largest health survey.

The study proposes use of state of the art statistical and machine learning methods to identify biomarkers for medical conditions; diabetes and obesity in this case. The statistical methods rely on lasso and group-lasso based techniques that can even use mixed continuous and categorical variables. The machine learning methods rely on tree based models that provide importance of variables in predictions. In contrast to relying solely on the widely used baseline statistics, which perform marginal analysis considering a single variable at a time, these methods are based on multivariate analysis of the medical conditions. Moreover, we recommend using an ensemble of methods complementing their findings. This is because some variables are either identified by only some methods such as calcium, phosphorus, triglycerides (as shown in Table 2), or variable significance could vary between the methods such as magnesium, chloride, insulin (as shown in Table 2 and Fig. 2). From gender stratified analysis, we found that some variables have higher significance in gender specific groups compared to the whole dataset. In diabetes study, uric acid has high significance in males and triglycerides have high significance in females. Similarly in obesity study, insulin has high significance in males and HBA1C% has high significance in females.

According to world health organization, drinking water accounts for \(29{-}38\%\) of the estimated average requirement of magnesium [32]. Nriagu et al. have found association of low mineral desalinated water with cancer [33]. Their findings of low magnesium water in \(99\%\) portable water supply can be one of the contributing factors in hypomagnesia shown in both cases and controls. Recently, Gommers et al. have also found hypomagnesia to be one of the causes of type 2 diabetes [34].

Although hypomagnesemia have been reported low in diabetes, to the best of our knowledge chloride is not reported low in diabetic subjects. Low levels of magnesium and chloride may be an indicator of renal impairment [35]. Moreover, our study has revealed interactions of hypomagnesemia with HDL-C, triglycerides, and free thyroxine. These findings need further investigations. In next study, we will have available genomics and proteomics data and we intend to use a more advanced integrative analysis tools to associate these two diseases with genetics and other factors.

Conclusion

Our study strongly confirms known associations and risk factors associated with diabetes and obesity in Qatari population as previously found in other population studies. For diabetes, biomarkers in Qatari population (as identified by different methods) include magnesium, calcium, HDL-C, chloride, insulin, c-peptide of insulin which have been previously reported by [36,37,38,39,40] to list a few. Similarly, for obesity, significant biomarkers (as identified by different methods) include insulin, c-peptide of insulin, albumin, and uric acid which have been previously reported by [41,42,43,44].