Background

It has been proposed that epigenetic variation may contribute to the risk of complex genetic diseases like cancer and that differential exposure to environmental risk factors may underlie much of this variation [1, 2]. Consistent with this view, a recent study has shown that regions that are differentially methylated between normal and cancer tissue appear to be highly variable in cancer itself, and that identification of cancer-relevant markers may therefore benefit from statistics that measure differential variability [3]. Based on these insights, we here aimed to demonstrate that analysis of epigenetic variability in prospectively collected normal cells can predict the risk of future morphological transformation.

In order to demonstrate this in humans, two requirements are mandatory: the cells that are used for epigenetic analyses need to be (i) the cells of origin for the studied cancer, and (ii) they need to be collected years in advance of the onset of cytological and morphological signs of cancer. Currently, the only human organ that meets these two requirements is the uterine cervix. Thus, we used Illumina Infinium technology [4] to measure DNA methylation (DNAm) at 27,578 CpG sites in cytologically entire normal cells (liquid-based cytology (LBC) samples) from the uterine cervix of 152 women (aged 19 to 55 years) in a nested prospective case-control study within ARTISTIC (A Randomised Trial of HPV Testing in Primary Cervical Screening [5, 6]). Prospective cases were women who developed a cervical intraepithelial neoplasia of grade 2 or higher (CIN2+) within 3 years of sample draw, while controls were women who remained disease-free. To further support our data we used completely independent LBC samples with abnormal cytology and associated controls, as well as cervical cancer tissue and normal cervix specimens.

Methods

Study population

The ARTISTIC trial

The LBC samples we analyzed were collected from women as part of the ARTISTIC trial [5, 6]. All women underwent two screening rounds with an interval of 3 years. Within the ARTISTIC trial, women, aged 19 to 64 years who were undergoing routine screening as part of the English National Health Service Cervical Screening Programme in Greater Manchester were randomly assigned in a ratio of 3:1 to either combined LBC and human papilloma virus (HPV) testing where the results were revealed and acted on, or to combined LBC and HPV testing where the HPV result was concealed from the patient and investigator. There were a total of 24,510 eligible women at entry. In the first round of screening 453 women had CIN2+. In the second round of screening 75 women (who were screen-negative in the first round and who had a sample stored from the first round) had developed CIN2+ (44 were HPV-positive and 31 were HPV-negative in the first round). Seventy-seven women who had not developed any cytological changes were matched (age and HPV status in round 1) to the cases. The cytologically normal samples from round 1 from these 152 women were used for DNAm analysis. Further details, cytology and HPV scoring are described in Additional file 1. This trial is registered with the International Standard Randomised Controlled Trial Number ISRCTN25417821.

DNA methylation nested case control study

A total of 152 samples in a prospective nested case control study within ARTISTIC were selected for DNAm analysis. Cases were 75 women who had normal cytology in screening round 1 but demonstrated CIN2+ after 3 years in round 2. Controls were 77 women who had normal cytology at entry and in the second screening round. Cases and controls were matched for age (Wilcox rank sum test P = 0.95) and HPV status: 92 were HPV-positive (44 cases and 48 controls) and 60 were HPV-negative (31 cases and 29 controls) at entry (Fisher test P = 0.74). Informed consent was obtained from the main study population and this study has been approved by the ethical committee (National Research Ethics Service Reference Number 10/H1107/15).

Other DNA methylation data sets

In addition to the nested case control prospective study within ARTISTIC, we used two additional DNAm data sets that were also generated using the same Illumina Infinium 27 k platform. Set 1 comprised a total of 30 LBC samples (19 HPV-negatives and 11 HPV-positives) with normal cytology and 18 LBC samples (all HPV-positive) with CIN2+, as described in [7]. Set 2 comprised a total of 63 cervical tissue samples: 48 cervical cancers, 15 normals. The normal cervical tissue samples were from women (mean age 55.4 years) who underwent a hysterectomy for fibroids of the uterine corpus, that is, these women did not have fibroids in the uterine cervix. The 48 cervical cancer specimens were from women (matched for age with mean age 56.8 years) who were treated at the Innsbruck Medical University. Of these 48 cancers, 26 were of stage 1, 10 of stage 2, 4 of stage 3 and 7 of stage 4 (1 sample had missing stage information). In terms of grade, 7 were of grade 1, 28 of grade 2 and 11 of grade 3 (2 samples had no grading). All specimens were obtained with informed consent and approval from the ethics committee UN4044-290/4.9.

DNA extraction and methylation assay

The DNA extraction protocol is described in Additional file 1. Genome-wide methylation analysis using the Illumina Infinium Methylation27K beadchip (Illumina Inc., USA, WG-311-1201) was performed. This chip interrogates the methylation status of over 27,000 CpG sites throughout the human genome, covering the promoters of over 14,000 genes [4]. Further details are in Additional file 1.

Data availability

All data in this manuscript have been deposited in the Gene Expression Omnibus repository [8] under accession number [GSE30760].

mRNA expression data of normal cervical and cervical cancer tissue

We used the publicly available normalized expression data (all Affymetrix) from the Gene Expression Omnibus with accession numbers [GSE9750] [9], [GSE7803] [10], and [GSE6791] [11]. For each downloaded data set we only selected the normal squamous cervix epithelial and squamous cervical cell carcinoma samples: for [GSE9750] there were 24 normal and 33 cancers, for [GSE7803] there were 10 normals and 21 cancers, and for [GSE6791] there were 8 normals and 20 cancers. Probes mapping to the same Entrez ID were averaged, resulting in 13,213 genes [GSE9750], 13,262 genes [GSE7803] and 20,827 genes [GSE6791].

mRNA expression data analysis

From the above three expression data sets we built an integrated (merged) expression set over 42 normal cervical epithelial samples and 74 cervical cancers using a procedure that we have validated many times previously [1214]. Briefly, there were 13,213 genes in common between all three expression arrays. For each of these genes and for each expression set we then renormalized the gene expression profile to mean zero and unit variance, yielding 'z-profiles'. For each gene, its z-profile in each of the three studies was then merged. This resulted in a merged expression set over 13,213 genes and 116 samples (42 normal and 74 cancers). For each gene, we then computed a t-statistic P-value against normal/cancer status. Of the 140 risk genes, 86 were found in the merged expression set. Of these 86, 46 exhibited differential expression P-values < 0.05. A binomial test was used to test the significance of the skew towards differential under- or overexpression. To adjust for any global (that is, over all 13,213 genes) skew towards under- or overexpression, we also estimated the P-value using a Monte Carlo procedure (1,000 Monte Carlo runs) in which 86 genes were selected at random and a binomial test P-value was recomputed. The fraction of the 1,000 runs with a binomial test P-value more extreme than the observed gives an independent P-value estimate.

Statistical methods

Full details of statistical methods are in Additional file 1. Brief descriptions of the different parts of the statistical analysis are given below.

Quality control and inter-array normalization

The raw DNAm data were subject to a similar quality control procedure as used in our previous publication [7].

Supervised analyses

To identify CpGs associated with age (aCpGs) we used surrogate variable analysis [15]. False discovery rate (FDR) estimation was implemented as in the q-value package [16]. To identify age-independent variable CpGs (vCpGs), we adjusted the data for age, and subsequently estimated the variances for each CpG. Because of the heteroscedasticity of β-values [17], we also estimated the variance using R-values (defined as R = M/U) [18]. To identify differentially variable CpGs (DVCs) between prospective CIN2+ cases and normals, we used Bartlett's test [19]. In doing so, variances were estimated after the methylation profiles were linearly adjusted for age within each phenotype.

EVORA: Epigenetic Variable Outliers for Risk Prediction Analysis

Full details and the model assumptions on which EVORA is based are described in Additional file 1. There are three important statistical aspects to EVORA: (i) identification of candidate risk CpGs; (ii) identification of samples that constitute methylation outliers; and (iii) a method for assigning risk to each sample, which is robust and independent of the scale used. For (i) we use Bartlett's test [19], since our hypothesis is that DVCs defined by outlier profiles are more likely to define risk CpGs [3]. To define outliers in a scale-independent fashion (ii) we use the COPA (Cancer Outlier Profile Analysis) transformation [20]. Lastly, to assign a risk score to a sample, we use an adaptive index algorithm framework [21]. EVORA is freely available as an R-package (evora) [22].

Results

DNA methylation variability is increased in cytologically normal cells predisposed to neoplasia

A stringent quality control and inter-array normalization procedure resulted in a normalized data matrix of methylation values (β-values, 0 < β < 1) over 24,039 CpGs and 152 samples (Methods). Prospective cases (n = 75) and controls (n = 77) were matched for age and HPV status (Methods). Following the suggestion that epigenetic variability may mark genes that contribute to the risk of cancer [1, 3, 23], we hypothesized that differential variability in normal tissue might be associated with an increased risk of neoplasia. We thus derived a list of CpGs that showed significantly different (age-adjusted) variability between future (CIN2+) cases and controls (DVCs) (Methods). We observed many DVCs (Figure 1a) and among the top 500 (FDR < 0.0001; Additional file 2) the majority (73%) were more variable in future cases (Figure 1c). The set of DVCs was largely unchanged if variability was not adjusted for age or if also adjusted for HPV status (Additional file 3). In contrast, testing for differential methylation using t-statistics (differentially methylated CpGs (DMCs)) did not yield genome-wide significance levels (FDR ~0.6 for the top 50 CpGs; Figure 1b). DVCs that showed significantly higher variance in future cases (hypervariable DVCs) generally exhibited small yet consistent increases in mean methylation levels (Figure 1d; Additional file 2). Inspection of typical methylation profiles of such DVCs revealed that the increased or decreased variability was due to approximately 20 to 30% changes in DNAm present in only a relatively small number of 'outlier' samples (Figure 1d; Additional file 4). Developmental genes, and in particular polycomb group targets (PCGTs) [24] (see Additional file 1 for precise definition), were enriched only among the DVCs showing increased variance in the normal cells of prospective cancer cases (odds ratio = 4.9 (3.9 to 6.3), P < 1e-31) and were six times more likely to exhibit higher variability in prospective cases than lower (Additional file 5). PCGTs were also the most highly enriched gene category out of a total of 6,173 gene sets in a Gene Set Enrichment Analysis [25] (Additional file 6). Random permutation of sample labels also showed that this enrichment could not have arisen by chance (Additional file 5). Thus, all these results indicate that increased DNAm variability affects PCGTs and is an intrinsic property of normal cells predisposed to neoplasia.

Figure 1
figure 1

Differentially variable and differentially methylated CpGs. (a) Histograms of P-values derived from Bartlett's test comparing differences in variance between normal samples that become neoplastic (CIN2+) and samples that remain normal (differentially variable CpGs (DVCs)). (b) Histograms of P-values derived from t-tests comparing differences in mean CpG methylation levels between the same two phenotypes (differentially methylated CpGs (DMCs)). (c) Scatterplot of Bartlett statistics (logarithm of the ratio of the variance in prospective CIN2+ to that in normal) shown on the y-axis against the corresponding t-statistics (x-axis) for the top 500 DVCs. The numbers of hypervariable (hyperV) and hypovariable (hypoV) DVCs are given. (d) Typical methylation profile of a hypervariable DVC (blue = prospective CIN2+, green = normal). The thin dashed lines indicate the mean levels of methylation in each phenotype. The P-values shown are from a Bartlett's test (differential variability) and t-test (differential methylation).

Age-associated variation in DNA methylation also correlates with the risk of neoplasia

An unsupervised singular value decomposition and a supervised linear regression analysis (adjusted for CIN2+ status) both confirmed a strong age-associated signature (644 CpGs at FDR < 0.05) with the majority (392, 61%) hypermethylated with age (Additional files 7 and 8). Interestingly, we observed that the 644 CpGs undergoing age-associated changes in DNAm (aCpGs) showed specific directional changes associated with the future development of CIN2+ even though these changes were not individually significant (Additional file 9). Specifically, those CpGs undergoing significant hypermethylation with age were generally also hypermethylated in future CIN2+ cases (Additional file 9). To test further if age-hypermethylated aCpGs are indeed associated with CIN2+, we compared their methylation levels in an independent data set (also generated with the Illumina 27 k platform) consisting of 30 normal and 18 age-matched CIN2+ samples (Set 1, Methods) [7]. The mean methylation level of these aCpGs was also significantly higher in the CIN2+ samples of this set (Additional file 10).

Significant overlap of hypervariable DVCs and age-hypermethylated aCpGs

Next, we explored the relationship between DVCs and aCpGs. Many CpGs showing age-associated hypermethylation also showed significant increases in (age-adjusted) variability within the normal tissues of future CIN2+ cases, while age hypomethylated aCpGs did not (Figures 2a-c). Thus, CpGs that are more variable in prospective CIN2+ cases independently of age overlapped significantly with CpGs that undergo age-associated hypermethylation in normal tissue independently of prospective disease status. This could mean that detecting DNAm changes across a group of individuals of the same age but who may have had variable lifetime exposures to environmental risk factors (and therefore be at variable disease risk) is similar to detecting age-associated changes in a population of differently aged individuals (since lifetime exposures accumulate with age). Because PCGTs were enriched in both hypermethylated (hyperM) aCpGs and hypervariable (hyperV) DVCs (Figure 2c), it was natural to ask if aCpGs mapping to PCGTs and that had been identified from other tissues (for example, whole blood) [7] would also exhibit a preferential skew towards hypervariability. Remarkably, out of the 69 PCGT CpGs identified as hypermethylated with age in whole blood [7], the overwhelming majority were more variable in the epithelial cells of future CIN2+ cases (Figure 2d). In contrast, the 20 PCGT CpGs undergoing age-associated hypomethylation in blood showed no skew towards either increased or decreased variability (Figure 2d). Thus, we can conclude that genes prone to epigenetic variation are also prone to undergo age-associated hypermethylation and that PCGTs define a significant subset of these genes.

Figure 2
figure 2

Relation between differentially variable and age-associated CpGs. (a) Bartlett test P-values (on -log10 scale) of CpGs indicating significance of differential variability (between prospective CIN2+ and controls) (y-axis) versus their average β-value across all samples (x-axis). CpGs undergoing age-associated (aCpG) hypermethylation (hyperM) or hypomethylation (hypoM) are colored as indicated. (b) The ratio (on log scale) of variability in prospective CIN2+ to variability in controls (y-axis) versus significance level (x-axis). Skyblue (orange) denotes CpGs significantly hypermethylated (hypomethylated) with age (aCpGs) in normal cells from uterine cervix. The green dashed line represents the FDR cutoff value of 0.05 for calling DVCs. (c) Venn diagram illustrating overlaps of age-hypermethylated CpGs with DVCs that are hypervariable (hyperV) in prospective CIN2+, and with PCGT CpGs. A total of 41 CpGs overlapped between all three categories and 20,917 CpGs were in none of the three categories. The P-value (estimated from a multiple binomial test) indicates the random chance of observing 41 or more overlapping CpGs. (d) As (b) but now highlighting the 68 and 20 CpGs that map to PCGTs and undergo age-associated hyper- (blue) and hypomethylation (red) in whole blood samples [7]. Among these CpGs, we give the number that are significantly differentially variable (FDR < 0.05, green dashed line) and their distribution in terms of increased or decreased variance in future CIN2+ cases. P-value is from a binomial test.

Differentially hypervariable CpGs predict risk of intraepithelial cervical neoplasia

Based on these results, we proposed the following model in which epigenetic variance may be used to predict the risk of neoplastic transformation. Since the typical DVC methylation profile (Figure 1d) is one in which a small number of samples exhibit increased outlier methylation (≥ 20% methylation change), we hypothesized that cancer-risk in a given sample may be associated with the number of such risk CpGs (hypervariable DVCs/hypermethylated aCpGs) becoming 'methylation hits' (Figure 3a). To test this idea, we applied a novel statistical algorithm called EVORA (Methods; Figure 3a), which aims to assess the risk of neoplastic transformation from the number of methylation outliers. Using multiple training/test set partitions, we found that EVORA could predict the future risk of CIN2+ in blind test sets (area under the curve (AUC) = 0.66 (0.58 to 0.75), P < 0.05; Figure 3b), while an analogous classifier based on differences in mean methylation levels could not (AUC = 0.51 (0.30 to 0.72), P = 0.46; Figure 3c).

Figure 3
figure 3

Epigenetic variable outliers for risk prediction analysis. (a) Flowchart describing the EVORA model. (i) Age-associated DNAm variation and age-independent differentially variable DNAm are both correlated with the risk of prospective neoplasia (CIN2+). aCpGs undergoing age-associated hypermethylation (hyperM) overlap strongly with differential variable CpGs (DVCs) that exhibit increased variance in future CIN2+ cases. This overlap defines, for a given training set, the pool of candidate risk CpGs. (ii) Multiple training/test set partitions in a ten-fold internal cross-validation on COPA-transformed (Methods) methylation profiles is used to optimize the COPA threshold and the set of risk CpGs. (iii) Risk prediction using EVORA: for an independent sample its risk score is estimated as the fraction of risk CpGs with a β-value larger than the optimal threshold, as evaluated in the COPA-basis. (b) EVORA receiver operating characteristic (ROC) curve, AUC and its 95% confidence interval in the ARTISTIC cohort (152 normal samples: 75 future CIN2+, 77 normals). (c) Comparison of C-index (AUC) values obtained using EVORA with a classification algorithm based on detecting differences in mean methylation levels (mean) in the ARTISTIC cohort. Boxplots are over 100 distinct training-test set partitions and P-values are from a Wilcoxon test detecting deviation from the expected null (C-index = 0.5) as well as between the two classification algorithms. (d) EVORA ROC curve in set 1 (48 liquid-based cytology samples: 18 CIN2+, 30 normals). (e) EVORA ROC curve in set 2 (63 cervical tissue samples: 48 cancers, 15 normals). In all ROC curves, AUC values and 95% confidence intervals shown. FPR: false positive rate; Se: sensitivity.

Risk CpGs identified in normal cells can detect intraepithelial neoplasia and cervical cancer

EVORA identified a total of 140 risk CpGs (hypervariable DVCs and hypermethylated aCpGs; Figure 4a; Additional file 11), of which 69 mapped to PCGTs. We postulated that this pool of 140 risk CpGs would also necessarily diagnose CIN2+ status, since for established neoplastic cells we would expect an even higher fraction of these CpGs to be hypermethylated. Indeed, in an independent Illumina Infinium 27 K methylation data set of normal cervical smears and age-matched CIN2+ samples (set 1; Methods), EVORA was able to predict CIN2+ status with very high accuracy (Figures 3d and 4b). Importantly, while the risk scores of the normal samples in ARTISTIC and set 1 were comparable to each other, the scores of the CIN2+ samples were significantly higher than those of normal cells that only become CIN2+ within 3 years (Figure 5), reflecting a progressive increase from normal cells at low risk, to normal cells at high risk, and finally to cells in a pre-invasive neoplastic state. Importantly, risk CpGs (that is, differentially variable CpGs) identified from the ARTISTIC cohort predicted CIN2+ status better than CpGs that were not differentially variable, even if they mapped to PCGTs (Figure 6). Since risk CpGs performed similarly irrespective of PCGT status (Figure 6), this indicates that differential variability is the key feature of cells at risk of morphological transformation and not PCGT status.

Figure 4
figure 4

Heatmaps over risk CpGs. (a-c) Heatmaps of COPA-transformed methylation values for the top 140 risk CpGs that are (i) significantly hypermethylated with age and (ii) show significant increased variability in future CIN2+ cases, as determined in the ARTISTIC cohort. Color codes for COPA scores: yellow = COPA score < 1 (no methylation); skyblue = COPA score < 5. Outliers denoted by blue = methylation COPA score > 5 and black = methylation COPA score > 10. CpGs have been hierarchically clustered using a Manhattan distance metric. Those mapping to PCGTs are labeled with their associated gene. Samples have been ordered according to their EVORA risk score as shown in the panels above heatmaps. (a) ARTISTIC cohort: 152 samples (75 prospective CIN2+ (red), 77 no CIN2+ at last follow-up (green). (b) Set 1: 48 samples (18 CIN2+ (red), 30 normals (green)). (c) Set 2: 63 cervical tissue samples (48 cancers (red), 15 normals (green)). (d) Heatmap depicts the same data matrix as in (c) but with the methylation values on the β-value scale where CpG β-values have been median normalized to zero. The corresponding scores now depict the percentage of methylation hits as measured on the beta-scale.

Figure 5
figure 5

Cross-comparison of EVORA risk scores. Boxplots of EVORA risk scores (y-axis) of the 77 normal LBC samples in ARTISTIC (N(ART)), the 30 normal LBC samples of set 1 (N(Set1)), the 75 prospective CIN2+ LBC samples in ARTISTIC (preCIN2+(ART)), and the 18 CIN2+ samples of set 1 (CIN2+ (Set1)). Wilcox-test P-values between N(ART) and N(Set1), and between preCIN2+(ART) and CIN2+(Set1) are given.

Figure 6
figure 6

EVORA AUC values as a function of differential variability and PCGT status. (a, b) Comparison of EVORA AUC measures for four different CpG classes in (a) set 1 consisting of 18 CIN2+ LBC samples and 30 normal LBC samples, and in (b) set 2 consisting of 48 cervical cancers and 15 cervical normal tissues. Of the 140 risk CpGs, 69 mapped to PCGTs (risk-PCGT class) and 71 did not (risk-nonPCGT class). In addition, we randomly selected 70 non-differentially variable, non-age-associated CpGs (nonrisk) that mapped and that did not map to PCGTs. The random selection was done 100 times and AUC values averaged. Also provided are 95% confidence intervals.

Next, we explored the methylation profiles of the 140 risk CpGs in an independent set of cervical cancers and corresponding age-matched normal tissue (set 2; Methods). The outlier risk scores were significantly reduced (Figure 4c) owing to the fact that risk CpGs exhibited much more of a bi-modal methylation pattern between cancer and normal tissue; that is, risk CpGs were invariably either hemi- or fully methylated across a high proportion of the cancers (Figure 4d). Nevertheless, a small subset of risk CpGs retained their outlier profiles in cervical cancer and EVORA was therefore still highly predictive of cancer status (Figure 4c). Adapting EVORA to the original beta methylation scale to better capture the observed bi-modal methylation profiles yielded a perfect discrimination of normal and cancer tissue (Figures 3e and 4d).

Risk genes are underexpressed in cervical cancer

To evaluate if the genes associated with the identified risk CpGs show expression changes in cervical cancer, we built an integrated data set of mRNA expression profiles over 13,213 genes and encompassing 42 normal cervical epithelial and 74 cervical cancer (squamous cell carcinoma) specimens (Methods). The integration was done using a renormalization procedure that we have validated previously [1214]. Of the 140 risk genes, 86 could be mapped to this merged data set. Comparison of average mRNA levels of our 86 risk genes between normal and cancer tissue confirmed that risk genes showed lower average expression levels in cervical cancer (Additional file 12). Analysis of individual gene statistics further showed that 46 were differentially expressed and that there was a significant skew towards underexpression in cancer (Additional file 12). Moreover, in only 13 of 1,000 random selections of 86 gene sets (Monte Carlo analysis described in Methods) did we observe a skew as significant as the one observed for the risk genes (Additional file 12).

Discussion

In this work we have addressed two fundamental questions. First, do DNAm changes precede the morphological signs of neoplastic transformation? And second, can these epigenetic changes (a) predict the risk of neoplastic development, and (b) allow detection of early (non-invasive) cancers? We have addressed these questions in the context of the uterine cervix, currently the only human organ allowing relatively easy access to the cell of origin of the associated cancer well in advance of the first morphological signs of neoplastic transformation.

Based on the hypothesis that epigenetic variability may reflect exposure to various genetic and non-genetic risk factors (including age), and that differential epigenetic variability may reflect underlying differential exposure to these factors [1, 3], we posited that differentially variable methylated CpGs (DVCs) might be potential risk indicators of neoplastic transformation. Our prediction algorithm based on epigenetic variance (EVORA; Figure 3A) could predict, with low but statistically significant accuracy (AUC = 0.66, P < 0.05), prospective CIN2+ status in blind test sets (Figure 3b). In contrast, an analogous classifier that uses differential methylation instead of differential variability statistics could not predict prospective CIN2+ status (Figure 3c). Using EVORA, the set of risk CpGs could also accurately identify independent CIN2+ and cervical cancer samples (AUC > 0.9; Figures 3d, e and 4). Using these additional data sets, we also showed that differential variability is a more important characteristic of risk CpGs (supporting the findings in [3]) than PCGT status (Figure 6). Thus, these results support the view that differential variability is a key characteristic of cells predisposed to morphological transformation.

We have obtained these results in the context of cervical cancer using a study design that was matched for age and HPV status. Since HPV infection is a well-known necessary causal factor for cervical cancer [26], it is thus reasonable to assume that those HPV-negative samples that later developed a CIN2+ will have done so only by acquiring an HPV infection after sample collection [27]. It is important, therefore, to point out that the increased DNAm variability at the identified loci only increases the risk of CIN2+ in the background of an existing HPV infection. Given that HPV infection is a necessary but not sufficient factor to cause cervical carcinogenesis (most HPV-infected women do not develop cervical cancer), it is indeed very likely that unknown 'tumor-suppressor' factors play a role in determining which HPV-infected women develop a CIN2+. Thus, our data support a model in which increased DNAm variability at the identified loci may compromise specific tumor suppressor functions, predisposing HPV-infected women to develop a CIN2+. In this context, it is also important to note that we did not observe HPV infection to have a major effect on DNAm patterns and the EVORA risk score itself was not correlated with HPV status (Additional file 13). Nevertheless, we did observe a very weak marginal association of DNAm with HPV status within only the prospective CIN2+ sample subgroup (Additional file 13). Thus, while increased DNAm variability may synergize with a background HPV infection to increase the risk of CIN2+, HPV infection itself only appears to have a minor effect on DNAm patterns in the earliest stages of cervical carcinogenesis. It is possible, however, that HPV infection may cause significant DNAm and mRNA changes during the latter stages of carcinogenesis, as reported, for instance, in the case of head and neck cancers [11, 28, 29].

In line with our hypothesis that epigenetic variable CpGs are involved in carcinogenesis, we find that they become more homogeneously methylated (that is, either hemi- or fully methylated) across cancers (Figure 4d), in stark contrast to their heterogeneous (that is, unmethylated versus hemi- or fully methylated) outlier profiles in potent CIN2+ and established CIN2+ cells (Figure 4a, b). This transition from heterogeneous outlier profiles in the earliest stages of cervical carcinogenesis to more homogeneous methylation profiles in the more advanced cancer stage points towards a dynamic process in which DNAm is initially acquired in a largely stochastic manner, as hypothesized in [1], but that later becomes more homogeneous and deterministic. However, we should point out that risk CpGs remain highly variably methylated across the invasive cancers, as previously reported in [3].

Interestingly, the heatmaps of the prospective and pre-invasive cancer series (Figures 4a, b) also revealed the existence of a striking bi-modality within the cases only, with a small subset of cases exhibiting a highly correlated pattern of CpG methylation, in contrast to the more stochastic pattern of methylation seen in the majority of samples. It is particularly interesting that the risk CpGs identified in ARTISTIC generated this same characteristic stochastic versus correlated bi-modal pattern in the CIN2+ samples of set 1. To investigate this further, we used an unsupervised clustering algorithm to divide up the CIN2+ samples (of set 1) and the cervical cancers (of set 2) each into two clusters according to the average level of methylation over all CpGs mapping to CpG islands excluding the 140 risk CpGs (Additional file 14). Interestingly, in both sets, the EVORA risk scores obtained from the 140 riskCpGs were significantly higher for the cases exhibiting the higher global levels of CpG island methylation (Additional file 14). Thus, the risk CpGs appear to correlate particularly strongly with CIN2+ samples and cervical cancers of overall higher CpG island methylation levels. It will be interesting, therefore, to explore the relationship, if any, of the identified risk CpGs to a potential CpG island methylator phenotype (CIMP) [30, 31] in cervical cancer. Using an unsupervised consensus clustering over the 1,000 most variable CpGs in the ARTISTIC set - that is, the unsupervised procedure used in [31, 32] to define a CIMP, did not yield a clear CIMP, nor did the clusters correlate with the EVORA risk scores, suggesting that there is no CIMP (in the sense defined in [31, 32]) present three years in advance of morphological transformation (Additional file 15).

Over the years extensive evidence has accumulated to demonstrate that PCGT methylation is crucially involved in carcinogenesis [3337]. Although we have here provided substantial evidence that differential variability is a more fundamental characteristic of transformation than PCGT status (Figure 6), our analyses nevertheless implicate a subset of PCGTs in the earliest stages of carcinogenesis (69 out of the 140 risk CpGs mapped to PCGTs). Numerous risk CpGs were also annotated to genes that have been previously demonstrated to be linked to cervical cancer, including CALCA [38], CXCR4 [39], DCC [40], HOXC10 [10], MYOD1 [41], SFRP4 [42], SOX1 [43], TWIST1 [44] and WT1 [45]. However, we note that only HOXC10 had its risk CpG site located within 200 bp upstream of the transcription start site. In fact, of the 140 risk CpG sites, only 64 were within 200 bp of the start site, and of these only 24 were upstream. Thus, even though we observed the genes associated with our risk CpGs to be frequently underexpressed in cervical cancer, it remains to be seen if methylation of the identified risk CpG sites plays a part in the underlying regulatory mechanism.

Our data also revealed a remarkably strong overlap between CpGs undergoing age-associated hypermethylation in normal tissue and CpGs that were hypervariable in prospective CIN2+ cases. PCGTs made a large component of this overlap and age-associated hypermethylated CpGs also correlated significantly with CIN2+ status, further strengthening our previous findings [7]. We stress that even those PCGT CpGs that were identified as undergoing age-associated hypermethylation in other tissues (for example, whole blood) [7] were also more hypervariable in prospective CIN2+ samples and could indicate the risk of neoplastic transformation. This strong overlap between hypermethylated aCpGs and hypervariable DVCs is consistent with a model in which differential exposure to risk factors accumulates with age, thus generating the observed stochastic epigenetic variability. That these methylation changes and overlaps are not due to changes in cell-type composition is supported by many studies [7, 4648]. We also observed here that changes in cell-type composition were more likely to be captured by age-independent variable CpGs (vCpGs), defined as CpGs that showed the largest (age-adjusted) variation across all normal samples. Specifically, these maximally varying vCpGs exhibited fairly large (> 80%) changes in methylation, did not overlap with DVCs, and accordingly were neither enriched for PCGTs nor discriminatory of prospective CIN2+ status (Additional file 9). In fact, vCpGs were enriched for Gene Ontology terms related to extracellular space and mesenchymal features, including also many cell differentiation markers (Additional file 16), and thus it is possible that methylation variation in these CpGs reflects variations in the epithelial to mesenchymal and stromal cell ratio. Since risk CpGs were not enriched for these stromal and mesenchymal features, it supports the view that their observed methylation changes reflect clonal heterogeneity within the epithelial cell population only.

The reliability of the methylation data generated from the ARTISTIC cohort is strongly supported by emerging biology and successful validations in two independent data sets. Nevertheless, to further check the reliability of the Infinium methylation data, we compared the β-values to those obtained using Methylight. Matched Infinium-Methylight data were available for SOX1 and WT1 in 48 LBC samples (set 1) [49]. We found statistically significant agreement between the two data types for both SOX1 and WT1 (Additional file 17).

Unfortunately, the limited coverage of the Infinium 27 K array (only 0.1% of the CpGs present in the human genome [50]) means that we could not fully explore the spatial methylation patterns around the identified risk CpG sites. As shown in Hansen et al. [3] in the context of cancer, differential variability is associated with increased spatial variability and loss of stability of the sharp methylation boundaries. Thus, it will be interesting to explore the spatial variability around the identified CpG risk sites with a more comprehensive and unbiased technology such as the Infinium 450 K methylation beadarray [51], as this may reveal a similar loss of methylation boundaries surrounding these sites.

Conclusions

We have demonstrated that variability in DNAm, which is associated with age and other factors, and which occurs well in advance (at least 3 years) of the first morphological neoplastic changes, is associated with the risk of neoplastic transformation. More generally, we have demonstrated that inter-individual epigenetic variance is an intrinsic characteristic of cells that become neoplastic.