Background

DNA methylation of the eukaryotic genome is essential for normal cellular differentiation and embryonic development [1, 2]. Methylation within gene promoter regions is important in the regulation of gene expression [3], and changes in both overall methylation and specific methylation patterns have been shown to vary within an individual according to tissue type and disease status [4]. In healthy people, inter-individual differences in methylation are also observed and it has been postulated that these differences may be influenced by various factors, including sex [516].

Studies that have investigated methylation differences between males and females in repeat sequences (LINE and Alu) and other targeted genomic areas have reported sex-specific methylation differences at various autosomal sites and suggested a tendency toward higher methylation levels in males than in females [511], although one study reported no significant difference [17].

Advances in microarray technology have enabled the assessment of genome-wide methylation by surveying CpG methylation at thousands of sites across the genome. The Illumina Infinium series is a commonly used commercial platform and the HumanMethylation27K was one of the first comprehensive methylation microarrays. It interrogates over 27,500 CpG sites in the promoter regions of more than 14,000 RefSeq genes. Studies using this array have reported that a large number of autosomal CpG sites (up to 5% of autosomal loci, or 1,333 CpGs) appear to be differentially methylated in females and males, and that global CpG methylation is higher in males than in females [10, 1216].

Recent work has demonstrated substantial sequence overlap between autosomal and sex-linked probes on the HumanMethylation27K microarray and that up to 10% of the probes are nonspecific and map to highly homologous genomic sequences [18]. The majority of the sex-associated methylation sites at autosomal CpG loci reported to date are likely to be technical artefacts created by the presence of cross-reactive autosomal probes hybridizing to both autosomal and sex chromosomes [18, 19].

We report new CpG associations with sex that are not due to nonspecific probes on the array. Many of the CpG associations lie in genes thought to be involved in spermatogenesis and male fertility, such as SLC9A2, SPESP1, CRISP2, and NUPL1. Pathway analysis revealed overrepresentation of genes differentially methylated between males and females in several Gene Ontology (GO) biological processes, such as regulation of immune response, RNA splicing and DNA repair. Our findings support previously reported global methylation differences between females and males: increased autosomal methylation in males and increased X chromosome methylation in females.

Results

X chromosome analysis and sex assignment

Following quality control (see Methods section), the final dataset included data on 7,333 samples from 81 studies from the European Bioinformatics Institute (EBI) database. Of these, 5,207 samples had sex recorded from the EBI phenotype files. The characteristics of this final dataset are displayed in Table 1.

Table 1 Characteristics of the final dataset (n = 7,333, 81 studies) included in the analysis

We initially sought to investigate whether sex could be inferred from X chromosome methylation data using principal component analysis (PCA). The first two principal components (PCs) were plotted against each other for all samples of known sex (Figure 1(A)). Colouring by the recorded sex from the EBI phenotype files indicates that sex can be determined by classifying samples based on their first PC, with samples recorded as ‘other/trisomy’ (n = 60) clustering in the middle. The second PC, in contrast, contributes little to the separation of males and females. Figure 1(B) shows that the samples of unknown sex cluster well with those of known sex. Logistic regression of recorded sex on PC1 and recorded sex on PC2 showed that both relationships were significant and, as expected, PC1 was a much better predictor than PC2 (PC1: P < 2e-16, AIC 2447.4; PC2: P = 1e-08, AIC 7037.2).

Figure 1
figure 1

Plot of the first two principal components (PC1 and PC2) based on the 999 X chromosome CpG sites for known males (n = 2,277), females (n = 2,870) and other (n = 60) (plot A) and with those of unknown sex (n = 2,126) superimposed (plot B).

Due to X-inactivation as a result of lyonisation [13], global methylation values across the 999 X chromosome sites contained on the HumanMethylation27K BeadChip were expected to be higher in females than in males and could therefore represent a robust means to distinguish between the sexes. A global X chromosome methylation value was calculated for each individual by summing the individual beta values at each of the 999 X chromosome CpG sites. Global X chromosome methylation across the whole cohort (n = 7,333) was approximately normally distributed (mean ± sd: 395 ± 83; kurtosis = 2.95, skewness = 0.21). For those of known sex (n = 5,207), global X chromosome methylation was significantly higher in females (mean ± sd: 455 ± 52) compared to males (mean ± sd: 328 ± 48; Welch Two Sample t-test P <2.2e-16).

Sex was also inferred by global X chromosome methylation values using the midpoint between the mean global X chromosome methylation values for males and females (391.5). Of the 5,147 recorded sexes (2,277 males and 2,870 females), using global X chromosome methylation to identify males gave a sensitivity (percentage of true males correctly identified) of 91.7% (2,088/2,277) and a specificity (percentage of true females correctly identified) of 89.3% (2,563/2,870) with an overall percentage of 90.7% samples where sex was correctly identified. By comparison, using PC1 to identify males gave a sensitivity (percentage of true males correctly identified) of 93.4% (2,127/2,277) and a specificity (percentage of true females correctly identified) of 93.4% (2,682/2,870) with an overall percentage of 93.4% samples in which sex was correctly identified. Given that PC1 was a more sensitive and specific method than using global X chromosome methylation, we used PC1 to re-classify all sexes in the sample (n = 7,333), resulting in a population of 3,647 females and 3,686 males.

Inspecting all samples (n = 7,333) after sex re-classification revealed some outliers, which were removed (Additional file 1: Figure S2 and S3), leaving a final population of n = 6,795, with n = 5,016 samples of recorded sex. Of these 5,016, 94.6% were consistent with sex as classified using PC1 (Additional file 2: Table S1).

For this final sample of n = 6,795, density plots of global X chromosome methylation by sex revealed distinct peaks using sex as assigned by PC1 (Additional file 1: Figure S3D). Global X chromosome methylation was significantly higher in females (mean ± sd: 463.9 ± 45.6) compared to males (mean ± sd: 314.9 ± 29.0; Welch Two Sample t-test P <2.2e-16) using sex as assigned by PC1.A receiver-operating characteristic (ROC) curve comparing the predictive ability of three metrics generated from the X chromosome methylation data (PC1, PC2 and global X chromosome methylation) showed that PC1 was the best predictor of sex (Figure 2). The area under the curve (AUC) was 0.948 for PC1 versus 0.936 for global X chromosome methylation, and only 0.553 for PC2.

Figure 2
figure 2

Receiver operating characteristic (ROC) curve comparing the predictive ability of three of the metrics generated from the X chromosome methylation data; PC1, PC2 and global X chromosome methylation.

Differences in autosomal methylation between sexes

Following quality control, 27,231 CpG sites on the HumanMethylation27K chip remained for analysis in 6,795 individuals who were successfully classified by sex (Figure 3(A) and (B)). Of these, 26,225 CpG sites were located on the autosomes (Figure 3(B)). A density plot of individual methylation beta values for each of the 26,225 autosomal CpG sites for all 6,795 individuals (Figure 4(A)) showed that across all studies and for both sexes, the majority (68%) of CpG sites had methylation values <0.3, whilst 17.4% of CpG sites had beta values >0.7, the range at which probes would be considered to be fully methylated [20, 21]. These percentages were not substantially different by sex (Additional file 2: Table S2).

Figure 3
figure 3

Flow chart of sample quality control (A) and CpG quality control (B) for the data downloaded from the European Bioinformatics Institute (EBI) database. PBL: Peripheral blood leukocyte.

Figure 4
figure 4

Density plots of methylation beta values at individual CpG sites (A) and global autosomal methylation (B) across all 26,225 autosomal CpGs in all 6,795 samples, coloured by sex.

Global methylation across the 22 autosomes was calculated for each sample by summing the individual CpG beta values across the 26,225 autosomal CpG sites. Global autosomal methylation was approximately normally distributed (mean ± sd: 7055 ± 652; kurtosis = 6.6, skewness = -0.31) but there were a number of female outliers (Figure 4(B) and Additional file 1: Figure S4), which skewed the mean female global autosomal methylation value (Additional file 2: Table S3). Median global autosomal methylation in males was slightly, yet statistically significantly higher than in women (median [IQR]: males 7,190 [6,770-7,426], females 7,135 [6,754-7,368]), Wilcoxon rank sum P =5.2e-05).

Global methylation across autosomal CpG sites varied greatly between studies (Additional file 1: Figure S5). Given that this variability could be due to underlying batch or cohort effects, a random effects meta-analysis was conducted for studies that included ≥20 individuals, and in which both sexes were represented in at least a 1:4 ratio (Additional file 3). A total of 39 studies (n = 4,172) met these criteria, of which 10 included cancer samples. The difference in global autosomal methylation according to sex in each of these studies is shown in Additional file 1: Figure S6. Meta-analysis revealed that global autosomal methylation was indeed very heterogeneous across studies, with the proportion of global autosomal variation in study estimates that is due to heterogeneity (I2) =87.9% [95% CI: 84.5%; 90.6%, P < 0.0001]. However, a funnel plot of the 39 studies did not indicate any major outliers (Additional file 1: Figure S7) and the 10 cancer studies clustered well with the 29 non-cancer studies. Despite the high heterogeneity, application of a random effects model suggested that global autosomal methylation was nominally higher in males than in females (mean beta difference = 48.9, P = 0.049).

PCA of individual beta values at autosomal CpG sites by sex was also highly confounded by batch effect, even after adjustment for study (Figure 5). Hence, to identify individual CpG sites differentially methylated between males and females, we conducted a random effects meta-analysis using the 39 studies (n = 4,172) in which both sexes were represented. Estimated overall effects (mean difference in beta values between females and males, weighted by study) and corresponding P values were calculated. Overall, 235 (0.9%) of the 26,225 CpG sites were associated with sex after Bonferroni correction (P <1.9e-06).

Figure 5
figure 5

Principal Components plots using the 1 st two principal components calculated using individual beta values at all autosomal CpG sites adjusted for study, coloured by study (A) and sex (B).

Autosomal gene-specific methylation

All previously identified non-specific or polymorphic CpG probes (as described by Chen and colleagues [18]) on the HumanMethylation27K array were cross-referenced against the 26,225 autosomal CpGs in the meta-analysis. In total, 2,783 of the 26,225 autosomal CpGs were non-specific and 838 were polymorphic. Of the 235 CpG sites associated with sex after Bonferroni correction in this study, 48 probes are non-specific and of the remaining 187 probes, 3 were polymorphic. These probes are indicated in the full results table (Additional file 4).

Meta-analysis of the cohort excluding cancer samples (31 studies, n = 2,900) was performed to ensure results were not due to confounding by cancer samples. There was good correlation between the P values from both meta-analyses, with linear regression r2 = 0.33, P < 2.2e-16 (Additional file 1: Figure S8). In addition, the majority (150/235) of the CpG sites that passed Bonferroni correction (P < 1.91e-06) in the meta-analysis of all samples passed the same threshold in the meta-analysis excluding cancer samples. The remaining 85 had P values between 1.94e-06 and 1.28e-02 in the meta-analysis excluding cancer samples (Additional file 4).

Mean differences in beta values between females and males were small. The mean beta difference between sexes for the 184 statistically significant probes was 0.037 (3.7%, Additional file 2: Table S4), with the vast majority of associated CpG sites (n = 178, 97%) more methylated in females than in males.

Pathway analysis revealed significant enrichment of genes with sex-associated changes in CpG methylation in 53 GO Biological Pathways at P < 0.05 (FDR adjusted). All biological processes enriched at P < 0.01 (FDR adjusted) are shown in Table 2 and largely comprise cellular ‘housekeeping’ functions. Gene overlap between pathways was relatively low, with 367 (73%) of the 500 genes in the top four processes (RNA splicing, DNA repair, protein modification by small protein conjugation and viral reproduction) unique to only one of these pathways (Additional file 1: Figure S9). These top four processes were relatively distinct from the individually most strongly sex-associated CpG sites, with only 10 genes from these processes represented in the CpG sites which passed Bonferroni correction. The median –log10 P values of the CpG sites of the genes in the top four biological processes compared to P values across all CpG sites are displayed in Figure 6.

Table 2 All results P < 0.01 (FDR adjusted) from the NetGestalt Pathway Analysis
Figure 6
figure 6

Boxplot of meta analysis –log10 P for all 26, 225 autosomal CpGs, compared to those located in the genes included in the top four enriched GO Biological Processes; RNA splicing (n = 112 genes), DNA repair (n = 116 genes), protein modification by small protein conjugation (n = 122 genes) and viral reproduction (n = 140 genes).

Discussion

In this study we conducted a meta-analysis of publicly available, genome-wide methylation datasets to examine the possible relationship between methylation profiles and sex. Following rigorous quality control, we analysed the data for differential methylation patterns in terms of global X chromosome methylation, global autosomal methylation, and differences between the methylation status at individual autosomal CpG probes according to sex.

X chromosome analysis

The X chromosome analysis demonstrated that, as expected, global X chromosome methylation is considerably higher in females compared to males. We also demonstrated that it is possible to accurately infer sex, based on PCA of X chromosome methylation data, and that this method is superior to using global X chromosome methylation. In light of the increasing evidence for the importance of correction for sex in analysis of disease-specific methylation patterns, this provides a valuable technique for studies where sex information is not easily accessible.

Autosomal methylation analysis

This analysis showed a small but significant increase in global autosomal methylation associated with male sex, concurring with previous studies [7]. Many mechanisms may account for this global difference, but have so far remained elusive. A recent study indicated that this global difference in autosomal methylation does not appear to be driven by sex hormones [22].

The findings of the analysis of methylation at individual CpG sites in this study have a number of features in common with the findings of Chen et al. [18]. After excluding probes which were found to be non-specific, Chen et al. [18] also found only small absolute differences between male and female samples in the CpG probes that they found to be significantly differentially methylated by sex (2-12%). Similar to this study, the majority (5/7) of the CpG sites that they found to be associated with sex were more methylated in females. Chen and colleagues reported associations between three (cg08124399 in DDX43, cg08532057 in NUPL1 and cg18485485 in DECR1) of our top 20 CpG sites and similar differences in beta values between sexes [18].

As predicted, the majority of the genes that Liu et al. [13] found to be differentially methylated according to sex were on the X chromosome. These investigators reported 11 genes with very small CpG differences in autosomes: LRRC2 TDGF1, RAB9P1, C6ORF68, TLE1, GLUD1, ALX4, DPPA3, NUPL1, FLJ20582 and FLJ43276. All of these genes were found to be significant in our analysis (Additional file 3), with most being among our top 20 loci. Apart from the NUPL1 CpG probe, however, all of these CpG sites were found to be non-specific by Chen et al. [18].

Autosomal sex-specific DNA methylation has been demonstrated in several candidate gene studies [5, 6, 14, 23, 24]. Although some of these genes (CDKN2A, MTHFR and MGMT) are also interrogated to some extent by the HumanMethylation27K array, we observed significant sex-specific DNA methylation only with MGMT (Additional file 3), with all MGMT probes being more methylated in females. These findings are in contrast to previous reports in which higher methylation was shown in males [5, 24].

Genes containing CpG islands with methylation differences of >5% between the sexes are listed in Table 3. Many of these top CpG associations are in genes that may play a role in sex-specific functions. The top-hit from our meta-analysis, SLC9A2, encodes Solute Carrier Family 9. Other Solute Carrier Family proteins have been shown to be associated with male infertility [25], suggesting a possible sex-specific role for this class of proteins. DDX43 has been shown to be differentially expressed in normal testis compared to testis of patients diagnosed with Sertoli cell-only (SCO) syndrome, and may play a role in spermatogenesis [25]. NUPL1 has been shown to be highly expressed in the testes of fertile men compared to that of infertile men, suggesting its function is important for male fertility [25]. These genes may be associated with male infertility secondary to epigenetic deregulation. The results of the pathway analysis, on the other hand, suggest that the influence of sex on methylation is broad. We report enriched biological processes across a wide range of cellular functions, some of which, including DNA transcription and RNA splicing, were also reported by Liu et al. [13].

Table 3 CpG sites with a differences in methylation of >5% between males and females in meta-analysis of 4,172 samples

Since the completion of this study, an increasing amount of data from the HumanMethylation450K microarray are becoming publicly available, along with tools for their analyses [26]. The 450K array offers higher-resolution methylation data than the 27K, and it will be interesting see what this new set of data will reveal about sex-specific methylation patterns.

Conclusions

With the inclusion and careful analysis of all publicly available datasets, this study represents the most comprehensive analysis of sex-specific methylation patterns to date. This is likely the reason that, unlike previous studies, which reported few detectable DNA methylation differences in autosomal genes between sexes [13, 18], we identified a modest number of CpG sites associated with sex after Bonferroni correction. Similar to other studies using microarray platforms, however, the mean beta differences associated with sex are very small (approximately 5%) [13, 14, 18]. This is far below the smallest difference (approximately 17%) that microarray platforms reliably estimate [27]. The reason for this is not clear; it may be that sex-associated methylation differences are highly dynamic, leading to heterogeneous samples. Nonetheless, the results reported here reliably demonstrate the existence of sex-specific methylation profiles, which is important not only in a biological context, but in recognising and dealing with potential confounding when undertaking methylation-disease association studies.

Methods

Data collation and quality control

We extracted publicly available methylation datasets from the European Bioinformatics Institute (EBI) database (http://www.ebi.ac.uk/arrayexpress). The vast majority of studies in this repository were conducted using the Illumina Infinium HumanMethylation27K microarray to assess genome-wide DNA methylation, and we therefore chose to analyse studies using this platform. This microarray uses 50-mer oligonucleotide probes to target 27,578 CpG sites covering ~14,000 autosomal and sex chromosome genes. These CpGs map to the promoter regions of genes with an average coverage of two CpGs per gene and more extensive coverage (3–20 CpGs) for cancer-related and imprinted genes. Infinium technology has previously been described for SNP genotyping [28]. In order to detect methylation differences, DNA is treated with sodium bisulfite, which converts unmethylated cytosines to uracil, whereas methylated cytosine is protected and remains unchanged. Two probes are designed for each CpG site—one is specific for the methylated allele (cytosine) and the other for the unmethylated allele (thymidine). The DNA methylation level for a CpG site is determined by dividing the signal intensity for the methylated CpG by the sum of both the methylated and unmethylated CpGs, previously shown to be a reliable estimate of the level of methylation at a locus [20]. This is referred to as the ‘beta value’ and approximates to the percent methylation divided by 100. Previous studies have shown that probes with beta values <0.3 represent unmethylated areas of the genome [20] and we therefore defined unmethylated probes on this basis. In cell lines, probes with beta >0.7 represent genomic loci that are fully methylated; however, as tissue samples frequently comprise a mixture of cell types and therefore a mixture of methylated and unmethylated probes, we set a beta value threshold of >0.3 to define methylated probes [20, 21]. In this study, in addition to analysing beta values at individual CpG sites, we also investigated global methylation values. A global methylation value was calculated for each individual in the study as the sum of the beta values at each CpG site, either across the X chromosome (global X chromosome methylation), or across all autosomes (global autosomal methylation).

Experiments from the EMBL-EBI database (http://www.ebi.ac.uk/arrayexpress) using the Illumina HumanMethylation27K BeadChip (HumanMethylation27_270596_v.1.2) up to March 2013 were incorporated into the study. As the majority of experiments made processed but not raw methylation data available in the EMBL-EBI database, we included only processed (normalised) datasets. Initially, 92 studies were retrieved from the EBI database. Samples not of human origin, and all samples derived from cell lines were excluded from further analysis. Duplicate entries were removed. Datasets that appeared to have been incorrectly processed (for example, those containing beta values <0 or >1) were excluded. Datasets which reported non- standard CpG site identifiers were also excluded.

Following dataset quality control, phenotype and methylation files were merged. There were 7,614 samples from 84 studies for whom matching methylation and phenotype data were available (Figure 3A). A list of the EBI Accession numbers for these studies, and the number of samples included from each study, is provided in Additional file 5.

Within each study, probes with reported detection P values (generated by Illumina GenomeStudio Software as an objective measure of overall probe performance) >0.05 were removed from the analysis. In the merged dataset of 7,614 samples, all CpG sites and samples with >10% missing methylation data were excluded. Consequently, the final dataset comprised 27,231 CpG sites from 7,381 samples (Figure 3A and B).

Following the exclusion of CpG sites and samples with >10% missing methylation data, 0.13% of the remaining 27,231 CpG sites from 7,381 samples had missing beta values. The median number of missing beta values per CpG site was 3 (IQR 1-5, range 0-191). The median number of missing beta values per sample was 8 (IQR 3-15, range 0-298). Missingness was at random (the probability of missingness was not significantly related to any observed variables – sex, study, CpG site nor individual).

Given that missingness rates were low, and randomly distributed, we did not anticipate missingness to be a problem in this analysis. Missing beta values were removed from the analysis of individual CpG probes.

Sex assignation using X chromosome data

The HumanMethylation27K Chip contains 1,085 CpG probes on the X chromosome. Of these, 86 were removed during quality control (Figure 3(B)), leaving 999 X chromosome CpG sites for analysis. A principal components analysis (PCA) was performed in R, using the prcomp function in the R package ‘stats’ [29]. One study (accession ID: E-GEOD-23311, ‘DNA Methylation in Human Chorionic Villus and Maternal Blood Cells’) was excluded due to being a major outlier on the PCA (Additional file 1: Figure S1).

Meta-analysis of individual CpG probes

A meta-analysis was performed to investigate associations between individual CpG probes and sex in a subset of studies that included at least 20 individuals, and both sexes in a ratio of at least 1:4 (Additional file 4). The meta-analysis was carried out using the R package ‘meta’ [29]. Inverse variance weighting was used for pooling. The DerSimonian-Laird estimate for the between-study variance was used in a random effects model [30]. To ensure that the meta-analysis results were not confounded by cancer samples, a second meta-analysis was performed using the same protocol. All 1,240 cancer samples from the original 4,172 were excluded from this analysis, leaving 31 studies (n = 2,900) that had > 20 individuals and were composed of both sexes.

Pathway analysis of the associations between individual CpG probes and sex

We conducted a pathway analysis using –log10 P values of all CpG probes from the meta-analysis, annotated by gene name. The NetGestalt web application was used to integrate these continuous data over GO biological processes using random walk distance-based hierarchical clustering for module identification [31]. Enriched modules were identified using the Kolmogorov-Smirnov test [32, 33]. The Benjamini–Hochberg method was used for controlling the false discovery rate (FDR) [34]. All annotated genes on the Infinium HumanMethylation27 arrays were used as the background list against which enriched-GO terms in target lists of genes were compared.