Umbilical cord blood DNA methylation in children who later develop type 1 diabetes

Aims/hypothesis Distinct DNA methylation patterns have recently been observed to precede type 1 diabetes in whole blood collected from young children. Our aim was to determine whether perinatal DNA methylation is associated with later progression to type 1 diabetes. Methods Reduced representation bisulphite sequencing (RRBS) analysis was performed on umbilical cord blood samples collected within the Finnish Type 1 Diabetes Prediction and Prevention (DIPP) Study. Children later diagnosed with type 1 diabetes and/or who tested positive for multiple islet autoantibodies (n = 43) were compared with control individuals (n = 79) who remained autoantibody-negative throughout the DIPP follow-up until 15 years of age. Potential confounding factors related to the pregnancy and the mother were included in the analysis. Results No differences in the umbilical cord blood methylation patterns were observed between the cases and controls at a false discovery rate <0.05. Conclusions/interpretation Based on our results, differences between children who progress to type 1 diabetes and those who remain healthy throughout childhood are not yet present in the perinatal DNA methylome. However, we cannot exclude the possibility that such differences would be found in a larger dataset. Graphical abstract Supplementary Information The online version contains peer-reviewed but unedited supplementary material available at 10.1007/s00125-022-05726-1.


Power analysis
The power analysis was performed in simulated bisulfite sequencing data through a tool developed by Lea et al. [1]. The following parameter values were applied: The relatedness matrix (relatedness between individuals based on correlations between their SNP profiles) was utilized as the data structure matrix. To be conservative, we assumed the data structure would explain a much larger proportion of the variation than the variable of interest (case/control) at sites that were simulated as true positives. Based on this simulation, the power to detect true positives was estimated to be 89.6 % (ESM Fig. 1) at FDR 0.05. Compared to this simulation, statistical power in the actual analysis was increased by adjusting for the spatial correlation between CpG sites. However, small methylation differences (e.g. 1 %) at individual CpG sites (that do not belong to a differentially methylated region) are not likely to be discovered in this study.
The power to detect differential methylation in real data with our workflow has been demonstrated in our preprint on sex-associated DNA methylation [2]. However, since the sample number was slightly lower in the present study, we re-evaluated the statistical power by repeating the differential methylation analysis between the sexes in these 122 samples (42 females vs. 80 males, while accounting for all the covariates listed in the main text Table 1) and comparing the results to two independent published studies on sex-associated umbilical cord blood DNA methylation that were based on a different technology (methylation microarrays) and different populations [3,4]. With 122 samples we were able to detect 3935 sex-associated CpG sites (FDR < 0.05) out of which 247 were covered by Illumina 450K methylation microarrays (and hence could have been detected by the earlier studies). We were able to confirm 145 sexassociated CpG sites out of 1675 CpG sites that were associated with sex in one or both earlier studies [3,4] and were covered by our RRBS data. The overlap was highly significant (Fisher's exact test p value < 2.2 × 10 -16 ). These 145 confirmed sex-associated CpG sites included sites with very small effect sizes. Their absolute coverage-corrected mean methylation differences between the sexes ranged from 0.4 % to 20.4 %, median 9.4 %.

Sample collection and HLA risk class determination
Umbilical cord blood was collected immediately after birth in 3 ml K3-EDTA tubes in the delivery room at the Turku University Hospital and transferred to the DIPP study centre where it was stored at -20°C. After receiving informed consent from the parents, each sample was transferred to the immunogenetics laboratory (destroyed in case informed consent was not received), where it was thawed, and a drop of blood was transferred to a sample collection card for HLA genotype determination. The remaining sample was stored at -20°C. DNA was extracted by applying salting out procedure [5]. HLA-DR-DQ genotypes were determined from the DNA on the dried blood spots using assays that were designed to densely probe the genomic regions associated with type 1 diabetes. The genotyping was started from major DQB1 alleles. A hypervariable region on the second exon was further sequenced for individuals with certain DQB1 haplotypes. A detailed description of the genotyping procedure and the risk class definitions have been reported earlier [6,7]. The genotypes corresponding to each risk class within this study are: Study participants were invited to the follow-up based on an increased risk to develop type 1 diabetes, according to the HLA risk class definitions at their time of birth. HLA risk class definitions have been updated during the study years to increase the sensitivity and specificity of the screening [6,7].

Islet autoantibody measurement
During DIPP follow-up, serum samples for islet autoantibody measurements were collected at 3month intervals until the age of 2 years and thereafter every 6 or 12 months until the age of 15 years or until type 1 diabetes diagnosis. Islet autoantibodies were measured with specific radiobinding assays and included IAA (insulin autoantibody), IA-2A (insulinoma-associated protein 2 antibody), GADA (glutamic acid decarboxylase antibody) and ZnT8A (zinc transporter-8 antibody). Classical islet cell antibodies (ICA) were used as the only autoantibody screening method for DIPP children born until the end 2002, and if positive, all other autoantibodies were measured from all earlier and future samples of the child. ICA, IAA, GADA and IA-2A were measured from all follow-up samples from children born since 2003, whereas ZnT8A were measured only if at least one of the other autoantibodies became positive [8]. In addition, all five islet autoantibodies were measured from all samples of the first 1000 children participating in DIPP follow-up [8].

Sample inclusion criteria
Out of 200 umbilical cord blood DNA samples within this analysis, 20 were rejected due to low (< 97 %) bisulfite conversion efficiency, two were excluded due to missing clinical data, and five were rejected due to inadequate amount or quality of DNA. Samples from individuals with any transient islet antibodies (N = 47) or persistent positivity for only one islet antibody (N = 4) were excluded from this study on type 1 diabetes but included in our study on perinatal DNA methylation associated with other variables [2].

Reduced representation bisulfite sequencing
DNA concentrations were measured with Thermo Scientific NanoDrop 2000, and the samples were purified with Genomic DNA Clean & Concentrator -10/ ZR-96 Genomic DNA Clean & Concentrator -5 plate kits (Zymo Research, cat. nos D4010 and D4066) according to the protocols for each kit. Library preparation was started from 200 ng of genomic DNA, and E. coli genomic DNA (USB, cat. no. 14380) was used as a carrier, 50 ng/library. Library preparation protocol for reduced representation bisulfite sequencing (RRBS) was adapted from [9]. As suggested by the protocol [9], a lower concentration of adapters (1:10 dilution) was used than recommended by the manufacturer to reduce the occurrence of adapter dimers. Bisulfite conversion and sample purification were carried out according to the Invitrogen MethylCode Bisulfite Conversion Kit protocol. Aliquots of converted DNA were amplified by 18 cycles of PCR with Taq/Pfu Turbo Cx Polymerase, a proofreading PCR enzyme that does not stall at uracil. PCR-amplified RRBS libraries were extracted using two subsequent rounds of SPRI bead clean-ups to minimize primer dimers in the final libraries. The high quality of the libraries was confirmed with either Advanced Analytical Fragment Analyzer or Bioanalyzer, depending on the size of the prepared library batch. The concentrations of the libraries were quantified with Qubit® Fluorometric Quantitation, Life Technologies, and only high-quality libraries were sequenced.
The samples were normalized and pooled for the automated cluster preparation which was carried out with Illumina cBot station. The libraries were run in 32 lanes, four-seven samples per lane. The samples were sequenced with Illumina HiSeq 2500 instrument using TruSeq v3 sequencing chemistry. Paired-end sequencing with 2 x 100 bp read length was used with 6 bp index run. The technical quality of the HiSeq 2500 run was high and the cluster amount was as expected. A minimum of 85 % of all bases above Q30 was required. The yields were 18 -37 million raw paired-end reads per sample. The base calling was performed using Illumina's bcl2fastq2 software, the output of which is of standard fastq format.
The code for the data analysis is available in GitHub [10]. TrimGalore [11] and Bismark [12] were used for trimming and alignment of paired-end RRBS reads on the GRCh37 (hg19) genome assembly [13]. After removing M-biases, potential single nucleotide polymorphisms [14], and sites with extremely high coverages, a minimum coverage of 10 reads was required for at least one third of the samples in both groups. The coverage distribution and a distribution of missing values per CpG site in the filtered data are presented in ESM Fig. 2

and 3.
To evaluate the coverage of potential genomic regions of interest, we downloaded all highconfidence enhancers that had evidence for interactions with genes from multiple sources ("double elite regulatory elements") for human chromosomes 1 -22 from the GeneHancer database [15] through UCSC table browser, GRCh37 (hg19) genome assembly, accessed on January 25 th 2022 [16]. R package GenomicRanges [17] was utilized to investigate the overlap between these enhancers and our high-coverage CpG sites (2568146 CpG sites that fulfilled the above-mentioned criteria). We also computed the distances between the high-coverage CpG sites and type 1 diabetes risk loci [18], excluding loci in the HLA region, through GenomicRanges.
For the differential methylation analysis between the cases and controls, a generalized mixed effects model (GLMM) implemented in the R package PQLseq [19] was fit separately for read counts at each CpG site on autosomal chromosomes. The covariates listed in the main text Table  1 were modeled as fixed effects and the genetic similarity between individuals as a random effect. Since we could not know, which covariates might be important confounding factors, we included all available reliably recorded information. However, only one covariate was selected from each group of mutually correlated clinical covariates (detailed inclusion criteria are listed in ESM Table 1). We also repeated the analysis with a reduced number of covariates, as described below. A few missing covariate values were median-imputed, and continuous covariates were Ztransformed. The Wald test p values computed within PQLseq were spatially adjusted by a weighted Z-test implemented in package RADMeth [20]. Since the spatially adjusted p values were found to be inflated, false discovery rate (FDR) was estimated empirically through a permutation analysis [2].

Alternative differential methylation analysis with a reduced number of covariates
The differential methylation analysis was repeated such that only necessary covariates were included in the GLMM. These were: Class (case/control): The variable of interest HLA risk class: This was the only clinical variable that correlated with class. We included it as a confounding covariate because the goal was to study DNA methylation associated with type 1 diabetes, independent of HLA risk that has already been thoroughly studied.
Sex: Sex was not significantly unbalanced between the groups (apart from HLA risk, neither was any other variable with available data -not even nominally). However, it was the only clinical variable that was clearly associated with differential methylation in these data, as described earlier [2]. It was therefore included in the reduced model to make sure the results would not be confounded by the slight unbalance between groups (40 % female in the case group vs. 32 % female in the control group). PC 1 and PC 2: the projection of the data on the first two principal components was included to account for technical variation and variation in cell type composition. In these data, PC 1 and PC 2 were observed to correlate with library preparation batches and bisulfite conversion efficiency but not with any clinical variable.
The top two CpG sites with smallest spatially adjusted P-values were the same as those from the above-described analysis with the full model (that included all covariates listed in main text Table 1): chr11:400288 and chr11:400295 (GRCh37 genome assembly) -the only CpG sites that showed even weak evidence of differential methylation as part of a candidate differentially methylated region, based on spatially adjusted p values with an empirically determined threshold at 5.33×10 -13 (ESM Fig. 4, ESM Table 2).
The results were concordant between the simple and the full model. Both models agreed that none of the CpG sites showed evidence of differential methylation based on Benjamini-Hochberg corrected p values (before the spatial adjustment). The correlation between log10 p values was 0.80, and the overlap of top 10 CpG sites ranked by p value was 3. The correlation between spatially adjusted p values was also 0.80, and the overlap of top 10 CpG sites ranked by spatially adjusted p value was 7.

Targeted bisulfite pyrosequencing
Targets for technical validation by bisulfite pyrosequencing were selected based on statistical significance in the RRBS analysis. Also, regions that were differentially methylated according to the DAISY study [21] and showed the same direction of difference in this study, were selected.
For this analysis, 28 case subjects and 30 control subjects were included, based on the following selection criteria: Amount of remaining DNA (determined by Qubit™ dsDNA HS Assay Kit, Invitrogen Ref Q32854), full-term (gestational age ≥ 37 weeks), similar sex distributions in both groups, normal birth weight (2.5 -4.5 kg), no multiple pregnancies, normal Apgar points (8 -10), no perinatal asphyxia, vaginal birth and no maternal smoking. PyroMark assay design 2.0 software (Qiagen) was used to design the assays for the methylation validation for the selected targets. Target specific primers designed with the software include primer pair with site-specific biotinylation for target amplification and pyrosequencing primer. The regions of interest were those listed in ESM Tables 2-4, and region chr17:38024237-chr17:38024290 (hg19 coordinates) which included six CpG sites that were differentially methylated between sexes. The sexassociated target was selected as a positive control to confirm that concordant results could be obtained with two technologies (RRBS and targeted pyrosequencing). The sequencing was done on three to five batches such that a roughly even numbers of male and female case and control subjects were allocated to each batch. A linear regression model was fit for each DNA methylation proportion, after applying a transformation: arcsin(2 × proportion -1). The explanatory variables were the same as those included in the RRBS data analysis. Each model was fit with and without the covariate of interest, and the significance of the association was estimated based on an ANOVA test comparing the two models.  ESM Table 1, modified from [2]: Description of variables with available data. Column 2 indicates whether the variable was included in the differential methylation analysis, which was done with a generalized linear mixedeffects model (GLMM). One clinical covariate from each group of mutually correlating covariates was included in the differential methylation analysis to account for potential confounding effects. Pearson correlations greater than 0.3 (in absolute value) with p value < 0.05 or alternatively Fisher's exact test p value < 0.05 (for pairs of binary covariates) were considered relevant. We prioritized continuous covariates and binary covariates with sufficient sample numbers in both groups. We avoided including covariates that contained missing values or are difficult to measure, such as the duration of delivery phase 1.

Details and reasons for inclusion/exclusion
Age, mother Yes The maternal age correlates only with the number of earlier pregnancies and deliveries, and is prioritized over them, since continuous covariates are easier to model than counts.

Birth length
No Correlates with birth weight Birth weight Yes Birth weight is included to represent the child's size. It is prioritized over birth height, head circumference and pregnancy duration, since birth weight has the best measurement accuracy.

BMI, mother
Yes This is the pre-pregnancy body mass index. Class (case/control)

Yes
The goal was to study differential methylation associated with this covariate. Case subjects were diagnosed with type 1 diabetes or became persistently positive for at least two islet autoantibodies, whereas control individuals remained completely negative for islet autoantibodies throughout the follow-up. Caesarean section

Yes
The mode of delivery was simplified to vaginal/C-section. C-section only correlates with perinatal asphyxia and is prioritized, since it is simple to define and does not include any measurement uncertainty Age at diagnosis, age at seroconversion, first-appearing autoantibodies Since the goal was to study differential methylation associated with later disease progression, independent of the well-studied HLA risk, HLA risk was included as a confounding covariate. The original data included four levels (neutral, slightly elevated, moderate, high) but neutral and slightly elevated were merged into one category. In practice, "HLA risk neutral" and "HLA risk high" were modeled as separate binary covariates (moderate was included in the intercept) Induced labor Yes Correlates only with gestational vaginal bleeding and is prioritized, since induced labor is simple to define, whereas gestational vaginal bleeding can have different degrees of severity Infant weight when discharged

No
Correlates with birth weight and is not easily comparable between different individuals, since they spend variable amounts of time in the hospital Insulin-treated diabetes, mother

Yes
This can be insulin-treated diabetes of any type (for example gestational diabetes). This covariate is prioritized over neonatal intensive care, neonatal hypoglycemia, and earlier C-section, since insulin-treatment throughout the pregnancy is expected to be more relevant for umbilical cord blood than events that take place before or after the pregnancy.

Yes
This is a categorical covariate with seven levels (transformed to six binary covariates + intercept The number of earlier deliveries was simplified to a binary variable (0=none, 1=one or more). It correlates with the maternal age.

Yes
The number of earlier miscarriages was simplified to a binary variable (0=none, 1=one or more). There was no reason to exclude this covariate, since it only correlated with the number of earlier pregnancies. Number of earlier pregnancies

No
The number of earlier pregnancies was simplified to a binary variable (0=none, 1=one or more). It correlates with the maternal age. Perinatal asphyxia

No
Correlates with C-section

Pregnancy duration
No Correlates with birth weight Principal components 1 and 2

Yes
The principal components analysis (PCA) on the coveragefiltered methylation proportion matrix was done after median-imputing remaining missing values for each CpG site. The projection of the data on principal components 1 and 2 was included in the model to represent technical variation.

Sex
Yes Correlates with no other covariate Smoking during pregnancy

Yes
The data included seven mothers smoking throughout the pregnancy and one mother smoking only during the first trimester. This variable was simplified to 0=no smoking, 1=smoking. Covariates related to in-utero conditions were generally prioritized. Transformed month

Yes
This is the month of birth, cosine-transformed (cos(2 m/12), where m is the month as numbers 1 -12) Umbilical arterial blood pH

No
Too many missing values (18 out of 122)

Usage of epidural anesthetic
Yes This is a binary variable indicating whether epidural anesthetic was used during delivery stage 1. The value was corrected from 1 to 0 for two individuals with an elective Caesarean section. This variable was prioritized over duration of delivery stages 1 and 2 since usage of epidural anesthetic is simple to define and does not include missing values or measurement uncertainty. Year of birth (= sample collection year)

Yes
Included to account for possible technical variation related to storage time at -20°C ESM  Fig. 1: A QQ-plot of simulated -log10 p values, compared to a uniform distribution. This is a visualization of statistical power, based on simulated bisulfite sequencing data, generated by the tool by Lea et al. [1]. The simulated data was generated with coverage range 10 -50 with 43 and 79 samples in each group, corresponding to our numbers of cases and controls. The proportion of variation explained by the group (case/control) was assumed to be 0.25, while the proportion of variation explained by other data structure was assumed to be 0.5. The proportion of data simulated as true positives was 0.1. The differential methylation analysis was performed with a linear mixed effects model with FDR threshold 0.05. The power to detect true positives was estimated to be 89.6 %. The amount of detected negatives (falsely detected as positives) was 0.756 %. The average difference between the groups was 18.4 % in this simulation.
ESM Figure 4: Two CpG sites on the intron of Plakophilin 3 (PKP3) showed weak evidence of differential methylation between cases and controls (not as individual CpG sites but as part of a region with spatially adjusted p values in the order of 10 -13 ). However, technical replication by pyrosequencing revealed that the differences were not significant. The RRBS and pyrosequencing results are presented here as boxplots. The raw p values for differential methylation (based on GLMM for RRBS and ordinary linear model for pyrosequencing, separately at each CpG site) are marked below each plot. The midline of each boxplot is drawn at the median, boxes range from the 1 st to the 3 rd quartile, and whiskers extend to the most extreme values.
ESM Figure 5: Altogether 6 sex-associated differentially methylated cytosines on the promoter of Zona Pellucida Binding Protein 2 (ZPBP2) were technically validated by targeted pyrosequencing. The RRBS and pyrosequencing results are presented here as boxplots. The raw p values for differential methylation with respect to sex (based on GLMM for RRBS and ordinary linear model for pyrosequencing, separately at each CpG site) are marked below each plot. The midline of each boxplot is drawn at the median, boxes range from the 1 st to the 3 rd quartile, and whiskers extend to the most extreme values. These results have been described earlier in [2].