Introduction

The incidence of type 1 diabetes is increasing, particularly in children [1]. Much of the aetiology of type 1 diabetes is accounted for by genetic predisposition [2, 3] and, in particular, by genes within the HLA class II region. HLA class II genotypes are used to select neonates for recruitment into natural history studies and primary prevention trials [3] and, together with islet autoantibody status, are used for recruiting children into secondary prevention trials [3]. However, screening is limited by low specificity of the genetic screen when applied to the general population or low sensitivity when screening is confined to children with a family history of type 1 diabetes.

Besides the HLA class II gene region, more than 40 regions of the human genome confer susceptibility to type 1 diabetes [4, 5]. The additional contribution of any single non-HLA region to risk stratification is small [5], but simple combination of multiple genes has been shown to aid the stratification of type 1 diabetes risk [6]. We reasoned that improvement in prediction might be achieved with an expanded susceptibility gene set and by weighting gene contributions. A previous attempt to combine the genes in weighted logistic regression models suggested that combination approaches should have modest expectations [7]. Advanced machine learning models that include model selection and feature ranking have been recently used to improve genetic prediction in other diseases [810]. Similar approaches have yet to be used for type 1 diabetes.

In this study, we applied multivariable logistic regression and Bayesian feature selection for 41 genetic susceptibility markers on data from the Type 1 Diabetes Genetics Consortium (T1DGC) containing over 4,500 cases and over 1,000 controls [11]. We used the T1DGC dataset to train our models and identify weighted single-nucleotide polymorphism (SNP) combinations affecting the development of type 1 diabetes. We quantified how well the models could generalise to unseen datasets by testing their performance on an independent validation set, subsequently assessed their predictive power for screening in families and performed simulated projections of risk for the general population.

Methods

Study population

Data from 4,574 people with type 1 diabetes and from 1,207 non-related control persons from the T1DGC dataset were used for analysis [11]. Results were validated in a second set from Germany [1214].

T1DGC set

The T1DGC study protocol has been described in detail previously [11]. For the present analysis, we used data from the T1DGC.2011.03 Taqman dataset consisting of individuals from multiple populations. Only people with European ancestry were included in the analyses. The mean age of diabetes onset was 7.9 years (SD 3.9, Table 1). Control persons had no family history of type 1 diabetes [11].

Table 1 Characteristics of the study sets

German validation set

The German validation set consisted of parents from the BABYDIAB study, including 437 individuals with type 1 diabetes and 423 non-related spouses as controls, and 328 children and adolescents with newly diagnosed type 1 diabetes from the DiMelli Bavarian diabetes register [1214]. The mean age at diabetes onset was 14.2 years (SD 7.6, Table 1).

BABYDIAB/BABYDIET cohort

The BABYDIAB and BABYDIET studies prospectively follow infants for islet autoimmunity and type 1 diabetes [14, 15]. Between 1989 and 2000, BABYDIAB recruited 1,650 offspring of patients from Germany who had type 1 diabetes [14]. Between 2000 and 2006, 792 offspring or siblings of patients from Germany who had type 1 diabetes were enrolled in the BABYDIET study. Islet autoantibodies were measured in samples taken at visits at age 9 months and 2 years and every 3 years thereafter, and every 6 months in children who were once tested positive for any of the islet autoantibodies. A subgroup of 150 children participated in the BABYDIET gluten intervention study and had 3-monthly follow-up visits from age 3 months to 3 years, and yearly thereafter [15]. The studies were approved by the ethical committees of Bavaria, Germany (Bayerische Landesärztekammer No. 95357) and the Ludwig Maximilian University (No. 329/00). Informed, written consent was obtained from all parents. The studies were carried out in accordance with the Declaration of Helsinki, as revised in 2000.

Genotyping

Typing for HLA class II alleles at HLA-DRB1, HLA-DQA1 and HLA-DQB1, performed according to the T1DGC protocol with a sequence-specific olignonucleotide-based linear assay [16], was available for 1,814 individuals from the T1DGC set. For the remainder, the SNPs rs2187668 and rs7454108 were used within the T1DGC set to tag the DR3-DQA1*05:01-DQB1*02:01 (DR3-DQ2) and DR4-DQA1*03:01-DQB1*03:02 (DR4-DQ8). HLA class II alleles HLA-DRB1, HLA-DQA1 and HLA-DQB1 within the validation set were determined using PCR-amplified DNA and non-radioactive sequence-specific oligonucleotide probes [11]. Genotyping of 40 non-HLA SNPs (electronic supplementary material [ESM] Table 1) within the T1DGC set was performed in the Taqman Laboratory, Cambridge, UK using TaqMan 5' nuclease assay (Applied Biosystems, Warrington, UK). Genotyping of 40 non-HLA SNPs within the validation set was performed using TaqMan Open Array SNP Genotyping (Applied Biosystems).

HLA risk genotypes were categorised as 6 = DR3/DR4-DQ8; 5 = DR4-DQ8/DR4-DQ8; 4 = DR3/DR3; 3 = DR4-DQ8/x; 2 = DR3/DRx; 1 = DRx/DRx (where x represents the non-DR3 and non-DR4-DQ8 alleles). For other SNPs, a score of 2 was given to persons homozygous for the susceptibility allele, 1 when heterozygous and 0 when homozygous for the non-susceptibility allele.

Statistical analyses

A multivariable logistic regression with SNPs as independent variables and type 1 diabetes as the dependent variable was performed. Log odds ratios βi were derived from the regression model

$$ \mathrm{logit}(p)={ \log}_{\mathrm{e}}\left(\frac{p}{1-p}\right)={\beta}_o+{\beta}_1{s}_1+{\beta}_2{s}_2+\dots +{\beta}_n{s}_n $$

with \( p=P\left(D=1\Big|{s}_1,\dots, {s}_n\right) \) the probability of developing diabetes, \( {\beta}_{\mathrm{o}} \) the intercept (baseline diabetes risk), \( {s}_{\mathrm{i}} \) state of SNP \( i \) (0, 1 or 2), \( {\beta}_i \) the log odds ratio of SNP and \( n \) the number of SNPs. The risk score \( p \) corresponds to the risk of each individual for developing diabetes according to the model. The log odds ratios can be regarded as weights (i.e. the higher the log odds, the more the SNP contributes to the risk score used for diabetes prediction). HLA was categorised into five variables (DR3/DR4-DQ8, DR4-DQ8/DR4-DQ8, DR3/DR3, DR4-DQ8/x, DR3/DRx), according to the above-mentioned six categories, where each variable contains a 1/0 indicator as to whether an individual belongs to that class. The sixth class is implicitly accounted for when all other five HLA indicators are zero. The multivariable logistic regression provides the contribution of the single SNPs to the total model and in this way differs from analyses of individual SNPs. Regression analysis was performed using the ‘glm’ function implemented in the R computing environment 3.0.2 (http://r-project.org).

To test for interaction effects, two complementary approaches were used. First, second-order interaction terms between all pairs of SNPs were introduced, resulting in the extended regression model

$$ \mathrm{logit}(D)={\beta}_o+{\beta}_1{s}_1+\dots +{\beta}_n{s}_n+{\beta}_{12}{s}_1{s}_2+{\beta}_{13}{s}_1{s}_3+\dots +{\beta}_{1n}{s}_1{s}_n+{\beta}_{23}{s}_2{s}_3+..... $$

Since this model contains too many parameters for the study training dataset, second-order interaction terms, \( {\beta}_{ij} \), were selected using forward model selection [17]. Second, support vector machines (SVM) with radial basis function (RBF) kernels and a Random Forest classifier [18, 19] were used as implemented in the R CRAN packages ‘e1071’ and ‘randomForest’, respectively (see also ESM Methods 1 and 2). All 41 features were provided to the classifiers, and the type 1 diabetes outcome was used as the outcome to be learned. Both classifiers are able to capture non-linearity and thus inherently account for interaction effects. Model quality was assessed using receiver operating characteristic (ROC) analysis [20]. To this end, all possible values of the risk score p were considered as thresholds to compute the sensitivity and specificity. The ROC AUC was derived as follows: (1) for the training dataset; (2) using tenfold cross-validation and (3) for a validation set. For cross-validation [21], the dataset was subdivided into ten fixed stratified folds (i.e. each fold contained the same ratio of cases and controls as the original dataset) and the average AUC over the ten folds was computed.

The increase in predictive power by adding minor susceptibility SNPs was computed using the integrated discrimination index (IDI) according to Pencina et al [22]. The IDI describes the difference between increase of average sensitivity and decrease of average specificity of the model. Model calibration was assessed using calibration plots as implemented in the ‘predictABEL’ R package.

Cumulative risk of multiple islet autoantibodies and/or type 1 diabetes development was estimated by Kaplan–Meier analysis. The p values were calculated by a logrank test. Follow-up was calculated from birth to the age when multiple islet autoantibodies developed or the age of type 1 diabetes diagnosis, or to the last contact.

Model selection and feature ranking

A Bayesian model selection algorithm to explore the model space spanned by all possible combinations of SNPs was used [9, 10]. Since the model space is prohibitively large (around 1012 potential models), efficient sampling based on reversible-jump Markov Chain Monte Carlo (rjMCMC) was used [9], an approach related to Bayesian penalised regression models [23]. This algorithm allows analysis of trans-dimensional models by randomly selecting a variable and then proposing either addition or deletion from the model. This results in calculations of a posterior probability for each model to be the best model (ESM Methods 3).

Based on results from the rjMCMC, marginal probabilities were computed for each SNP. These marginal probabilities were then used to generate a feature ranking. An alternative feature ranking based on the log odds from the full multivariable logistic regression was generated. Moreover, we generated 500 random rankings for comparison, where a randomised order of SNPs was used instead of ranking them by a statistical approach. For all rankings, the predictors were then used in a multivariable logistic regression model, where the predictive power of the model was assessed using ROC analysis in a tenfold cross-validation.

Results

Prediction of type 1 diabetes using HLA class II genotypes and minor susceptibility genes

Building a multivariable logistic regression model that included HLA risk stratification into six categories without additional susceptibility SNPs yielded a ROC AUC of 0.82 (95% CI 0.80, 0.83) in the T1DGC set, a tenfold cross-validation AUC of 0.81 (95% CI 0.79, 0.82) and an AUC of 0.78 (95% CI 0.75, 0.80) in the validation set (Table 2, Fig. 1a). Higher discrimination was achieved when SNP genotyping of the 40 minor susceptibility genes was added to the HLA risk model, with an AUC of 0.87 (95% CI 0.86, 0.88) in the T1DGC set, an AUC of 0.87 (95% CI 0.85, 0.88) in the tenfold cross-validation and an AUC of 0.84 (95% CI 0.81, 0.86) in the validation set (Table 2, Fig. 1a). The IDI for the increase in prediction accuracy from the HLA-only model to the model including all SNPs was 0.0986 (p = 3.1 × 10−25). All models showed good calibration properties (ESM Methods 4). Log odds ratios for each SNP in the multiple logistic regression model are displayed in Fig. 1b, and the genetic risk score distributions in patients and control sets are visualised in Fig. 2.

Table 2 AUC values from the ROC analysis for the prediction of type 1 diabetes based on genetic markers in the T1DGC set and the validation set
Fig. 1
figure 1

Prediction of type 1 diabetes based on HLA class II genotypes and minor susceptibility genes. (a) ROC curve for the prediction of type 1 diabetes using HLA class II genotypes and HLA plus 40 SNPs based on multivariable logistic regression. The predictive power of the model is shown in the training set (pink line), tenfold cross-validation (green line) and in the validation set (blue line). Solid lines represent the HLA plus 40 SNPs model, dashed lines mark the ROC curves for HLA only. (b) Effect sizes of all variables quantified by log odds ratios. Error bars indicate 95% CIs. *p < 0.05 and **p < 0.005

Fig. 2
figure 2

Risk score histogram on the validation set. Probabilities from the logistic regression model are shown for patients with type 1 diabetes and controls. White bars, controls; dark grey bars, cases; light grey bars, overlap

To account for possible interaction effects between variables, we constructed extended logistic regression models with second-order interaction terms between all pairs of SNPs as well as logistic regression models with interaction terms between HLA and non-HLA SNPs. Moreover, we applied a support vector machine classifier with RBF kernel and a Random Forest classifier, which are predictive models inherently considering interactions between variables. We did not observe any improvement in AUC over the logistic regression model (test AUC 0.74 for logistic regression with SNP–SNP interaction terms, 0.83 for logistic regression with SNP–HLA interaction, 0.75 for the SVM, 0.82 for random forests) compared with the reference AUC value of 0.84 from standard logistic regression. This indicated that interaction effects for the genetic factors analysed did not play sufficient a role to be considered in prediction models.

Selection of a reduced set of SNPs with comparable prediction quality

We investigated whether a smaller set of SNPs could achieve similar discrimination to that provided by the full 41 features using a model-selection and feature-ranking method based on rjMCMC sampling. This stochastic method explores all potential logistic regression models (i.e. all combinations of SNPs). Figure 3 illustrates the selection results, showing both gene combinations and model probabilities, and also which combinations of SNPs should be selected for discrimination. SNP combinations (models) ranked highest contained similar sets of only a few SNPs. For example, the top ten models included HLA, a core set of seven SNPs from PTPN22, INS, IL2RA, ERBB3, ORMDL3, BACH2 and IL27 genes and between one and five additional SNPs. This indicated that HLA and the core set of seven SNPs were essential for a good performance, while the performance could be improved by adding interchangeable SNPs from a larger pool of additional SNPs.

Fig. 3
figure 3

Feature ranking using rjMCMC, showing 285 accepted models from the rjMCMC algorithm. The plot visualises how often and in which combinations discriminative SNP sets are selected. (a) Coloured rectangles indicate that a SNP was included in the respective model. The colour codes refer to the log odds of the SNP in the model. The frequency with which a SNP appears in these models can be interpreted as the importance of the SNP for classification. (b) Posterior probabilities of the models. Note that all models displayed here can be regarded as viable in the model selection process

To select an optimal number of SNPs to be used, we derived a feature ranking based on the marginal inclusion probabilities of each SNP. Ranking the features either by the rjMCMC model selection approach or by log odds (high to low) from the multivariable logistic regression model yielded almost identical feature rankings. To further demonstrate the benefit of our variable ranking, we also generated 500 randomised variable orderings (Fig. 4). The plot allowed us to choose a customised trade-off between the number of genes in the model and model performance. For example, if the first ten SNPs were selected (HLA, PTPN22, INS, IL2RA, ERBB3, ORMDL3, BACH2, IL27, GLIS3 and RNLS), the discriminating value was an AUC of 0.86 (95% CI 0.84, 0.88) in the T1DGC set, 0.86 (95% CI 0.84, 0.88) in the tenfold cross-validation and 0.82 (95% CI 0.79, 0.84) in the validation set. This was only slightly worse than the full model containing all SNPs (0.84; 95% CI 0.81, 0.86 in the validation cohort).

Fig. 4
figure 4

Performance evaluation of ranked SNPs. Model performance for cumulative SNPs included in the model is shown, illustrating the trade-off between the number of genes in a model and model performance. The order of the SNPs corresponds to the rjMCMC-based feature ranking, wherein SNPs are included in a cumulative fashion from left to right, starting with HLA category 6. Cross-validation performance (dashed line) as well as performance in the validation set (dotted line) are shown, together with a performance curve for multiple rounds of feature inclusion at random (no feature ranking, solid line; error bars indicate SDs over 500 randomisations)

Application of model to screening in families

The performance of the reduced model was assessed in longitudinal data from the German BABYDIAB and BABYDIET studies [14, 15]. Genetic data for the ten SNPs in our reduced model were available for 1,772 children, including 99 who developed multiple islet autoantibodies and/or type 1 diabetes during follow-up. As expected for first-degree relatives of patients, the distribution of risk scores derived from our ten-SNP model in the BABYDIAB and BABYDIET children was slightly shifted away from the distribution in the validation set (p = 0.0003, Fig. 5a). The 1,772 children were post hoc stratified into four risk score centiles (<10th centile, 10th to 50th centile, 50th to 90th centile, > 90th centile). Markedly increased risk of multiple islet autoantibodies or type 1 diabetes was observed in children with scores above the 90th centile (5 year risk, 18.2%; 95% CI 12.3, 24.1; n = 177) as compared with children with intermediate scores in the 50th to 90th centile (3.5%, 95% CI 2.1, 4.9; p < 10−10 vs >90th centile; n = 708) and 10th to 50th centile (2.5%, 95% CI 1.3, 3.7; p < 10−10 vs >90th centile; n = 710), or scores below the 10th centile (0%, p < 10−10 vs >90th centile; n = 177; Fig. 5b). Children with scores above the 90th centile included 39 (40%) of the 99 children who developed multiple islet autoantibodies or diabetes.

Fig. 5
figure 5

Performance of risk model in a prospectively followed cohort of children. (a) Distribution of the risk score derived from our ten-SNP set (HLA, PTPN22, INS, IL2RA, ERBB3, ORMDL3, BACH2, IL27, GLIS3, RNLS) in children from the BABYDIAB and BABYDIET studies. (b) Cumulative risk for the development of multiple islet autoantibodies or diabetes based on the risk score derived from the top-ten-SNP set. The number of children still followed in each of the categories (black solid line, above the 90th centile; dashed line, 50th to 90th centile; dotted line, 10th to 50th centile; grey solid line, below the 10th centile) is shown, p < 0.001 overall. (c) Cumulative risk for the development of multiple islet autoantibodies or diabetes in children from the BABYDIAB and BABYIET studies with HLA-DR3/DR4-DQ8 or HLA-DR4-DQ8/DR4-DQ8 genotypes based on the risk score derived from the ten-SNP set, p = 0.007. The number of children still followed in both categories (black solid line, above the 90th centile; dashed line, below the 90th centile) is shown

We previously showed that HLA DR-DQ genotyping alone can stratify risk of multiple islet autoantibodies and that children with HLA-DR3/DR4-DQ8 or HLA-DR4-DQ8/DR4-DQ8 genotypes had substantially increased risk [24]. We therefore examined whether the ten-SNP score was able to discriminate risk in children who had HLA-DR3/DR4-DQ8 or HLA-DR4-DQ8/DR4-DQ8 genotypes (Fig. 5c). Of the 153 children with high-risk HLA genotypes, 109 children had a feature model risk score above the 90th centile of all 1,772 BABYDIAB and BABYDIET children. The 5 year risk for multiple islet autoantibodies or type 1 diabetes was 22.7% (95% CI 14.6, 30.8, n = 109) in the HLA high-risk children with risk scores >90th centile and 7.4% (95% CI 0.1, 15.6, n = 44) in the remaining 44 HLA risk children with risk scores below the 90th centile. Of the 32 HLA high-risk children who developed multiple islet autoantibodies or type 1 diabetes, 29 (91%) had risk scores >90th centile.

Simulated application of model to population screening

We subsequently asked how the genetic selection might perform in general population screening using simulated projections of risk. We calculated hypothetical population-based positive predictive values at different specificities, assuming a disease prevalence of 0.5% by the age of 20 years (Table 3). For high sensitivity, the simulated model proposes a threshold that would identify >50% of future cases and would require selection of 10% of the population; these children will have an estimated 2.6% risk for type 1 diabetes. For high specificity, selection of children with up to 20% risk might be achieved using a threshold that selected 0.5% of the population and identifies 24.1% of cases (e.g. for 99.5% specificity; Table 3). Using the latter example, if 200,000 children were screened, of whom 1,000 (0.5%) are expected to develop diabetes, we would select 1,236 with a risk score >0.97. Of these 1,236 children, 241 are simulated to develop type 1 diabetes before age 20 years. In comparison, the highest HLA risk genotype (DR3/DR4-DQ8) alone is simulated to have a specificity of 98.8% (2,672 selected) and a risk of 10.7% (284 developing diabetes; Table 3).

Table 3 Performance of our ten genetic marker model in general population screening

Discussion

The use of weighted models incorporating genotype information for HLA class II genes and SNPs from 40 minor susceptibility genes provided a relatively high discrimination for type 1 diabetes. Although HLA genes provided the major contribution to prediction, the addition of SNP genotypes from minor genes significantly improved prediction models. There was no further improvement observed by considering interactions between the 41 genetic markers. Feature selection identified HLA plus seven SNPs from the PTPN22, INS, IL2RA, ERBB3, ORMDL3, BACH2 and IL27 genes as the minimal set of genetic markers to include in high-performing weighted risk models and incorporating these plus two other SNPs could achieve similar prediction accuracy as the total set of analysed genes.

Our study was based on a large training set that included SNPs covering validated type 1 diabetes susceptible genes. The robustness of the findings was confirmed on a second independent set. Novel aspects of the analysis include the use of multivariable logistic regression that examined the contribution of SNPs collectively rather than individually. The resulting value was a weighted score indicating the genetic risk of developing disease for each person. An additional novel aspect was the use of feature selection as a tool to identify a limited set of SNPs for prediction. This is an extension and sophistication of our previous approach with a limited SNP set performed on a small cohort [6]. Some of the non-HLA SNPs selected in the high-performance models from the previous study (PTPN22 and ERBB3) were also selected by the current model. There were important differences between this and our previous study that are likely to have limited the overlap in the identified SNP sets. First, the previous study did not include genotyping for the majority of the seven non-HLA SNPs, which were essential for highly predictive models in the current analysis. Second, the previous study selected individuals with HLA risk genotypes instead of using HLA as a factor in the model. Third, the previous study was performed only on children who had a family history of type 1 diabetes. Fourth, SNP sets were previously selected without allowing different weights for the SNPs. Finally, it is theoretically possible that more SNPs could improve our model if a larger dataset was used.

A potential limitation of our study is that the analysis was performed on cross-sectional data rather than on a prospective dataset. Application to the BABYDIAB and BABYDIET cohorts provided some appreciation of how the model could perform in a prospectively followed population. If selection into the BABDIET study had been based on a ten-SNP risk score that identified the upper 10th centile of the children screened, we would have enrolled 130 children, 21 of whom developed diabetes during follow-up. In comparison, the actual selection that was based on HLA typing alone identified 169 children, of who 12 developed diabetes. This example is limited to children who have both a family history and high genetic risk score. Familial cases may be enriched for unusual cases such as those associated with rare variants. Thus, it will be important that the model is properly validated in prospective studies within the general population where absolute risk is substantially lower than in relatives. It is also likely that the models we have identified are not optimised for all ethnic and regional groups [25].

Our analysis has relevance to ongoing and future natural history and prevention studies performed in children who are genetically at risk for type 1 diabetes [2628]. Selection is currently based on type 1 diabetes family history and/or HLA risk genotypes. We simulated a broader application of a weighted model for the set of ten genetic markers identified in the present study to general population screening. In the simulated example, we could select children with around 20% risk, and include nearly a quarter of future cases when thresholds were set to select 0.5% of the general population (Table 3). Typing could be achieved with two or three SNPs from the HLA region as recently shown [29, 30] and the nine SNPs from the additional genes. True performance will require actual validation, but screening based on these ten genetic markers may provide a more efficient selection of risk children than screening with HLA alone.

In conclusion, we were able to improve prediction for type 1 diabetes by multiple logistic regression and feature ranking analysis methods on large susceptibility SNP sets. We suggest that these approaches and weighted SNP genotype models similar to those that we have identified could be used for selection of cohorts of at-risk children in natural history and appropriately safe prevention studies.