Background

DNA methylation plays an important role in differentiation and development [1, 2]. Alterations in DNA methylation patterns are associated with various human diseases, including cancer and diabetes [3,4,5]. DNA methylation is the covalent addition of methyl groups to DNA. In mammals, methyl groups are most commonly added to a ‘CpG site’, which is a cytosine (C) base that is immediately adjacent to guanine (G) base. There are ~ 28 million CpG sites in the human genome, and 70–80% are methylated in normal, healthy cells. The remaining unmethylated sites are found in clusters that are often 1000 bp long, termed CpG islands [6]. CpG islands are predominately located at gene promoters, and these genes are typically expressed and include almost all the housekeeping genes present in the human genome [7, 8]. In cancer, CpG island promoters are prone to hypermethylation and associated genes silencing. In contrast the bulk of the genome in cancer is subject to hypomethylation and gene activation of cancer-associated oncogenes [9, 10].

To enable comprehensive methylation studies to be performed across large-scale clinical cohorts in order to profile methylation changes as we age or to identify changes that occur with disease progression, we need cost-saving advances in methylation sequencing technologies. To date, several methods have been developed to perform DNA methylation analysis on a genome-wide scale. These can be divided into three broad categories: (1) enrichment based, either using antibodies, restriction enzymes or by immuno-precipitation followed by sequencing [11,12,13,14], (2) bisulphite conversion followed by sequencing [1, 15] and (3) array-based methods, such as Illumina arrays [16,17,18,19]. Among these, bisulphite conversion followed by next-generation sequencing is known as the best approach to provide the complete human methylome [20,21,22]. Whole genome bisulphite sequencing (WGBS) maps cytosine methylation across the entire genome at single base resolution. WGBS is currently the standard method of choice for studies that generate reference methylomes [23,24,25]. In addition, WGBS is also increasingly used in basic and clinical research [26]. Although WGBS has been widely accepted as the gold standard method to assay genome-wide DNA methylation, the high cost [22] and depth of sequencing required [27] make it a challenge for large-scale DNA methylation studies. Ever since the time WGBS was developed in 2009 the number of large-scale WGBS studies is limited [28, 29], either small numbers of samples are studied at a high coverage [15, 30] or larger numbers of samples are studied at low coverage [31]. One of the earliest publications on WGBS was performed at 30× coverage in order to compare the DNA methylation profiles of an embryonic stem cell and a fibroblast cell line [15]. Here, 376 lanes of bisulphite sequencing were performed on the Illumina GAII instrument to achieve this coverage. Similarly, Kulis et al. [30] performed WGBS on sorted B cells at different stages of differentiation at a depth of 54× for 12 biological samples on at least 85 lanes of the HiSeq 2000. Large-scale methylome studies may not therefore be affordable to many researchers due to the high cost required to run multiple lanes. The advent of the HiSeq X Ten has opened up possibilities of generating WGBS data with better coverage per lane of sequencing making it potentially more cost-effective.

Despite the promising potential of performing methylome studies on the HiSeq X Ten, achieving optimum coverage of WGBS is challenging. HiSeq X Ten has a fixed sequencing length of 300 cycles (150 bp paired end). Therefore, to achieve maximum output with higher diversity from this platform, the library needs to be at least 300 bp long—excluding the length of the adaptors on either side. Achieving this large library size for a WGBS library is difficult due to the fragmentation process during bisulphite conversion [32]. In addition, since bisulphite-converted libraries comprise an unbalanced base composition, it is a challenge to achieve optimal cluster passing filter of the library without balancing the library with another library of uniform base composition. However, the fast EXamp amplification chemistry of the HiSeq X Ten can result in preferential amplification and cluster formation of smaller insert size libraries. Therefore, even a 5% contamination of adaptor dimers in the library can result in up to 60% of the sequencing output being adaptor sequence. In addition, a higher loading concentration of the bisulphite library can lead to polyclonal clusters and a lower loading concentration can lead to higher duplicate reads.

In this study, we optimised and developed a working protocol for the preparation and processing of WGBS data prepared from good quality DNA and FFPET material to maximise data output from the HiSeq X Ten (Additional file 2: Fig. 5a, b). We provide guidelines on the best method to achieve larger bisulphite library size and optimum loading concentration for the bisulphite library and the spike-in library, for the HiSeq X Ten, HCS 3.3.76. We consistently achieve ~ 16–20× coverage per lane of WGBS data for the good quality DNA and ~ 10–14× per lane for FFPET. Finally, we explored the possibility of performing integrated WGS and WGBS from the same DNA samples in the same lane of the HiSeq X Ten platform and show that this results in minimal read wastage. Higher methylome coverage using the HiSeq X Ten platform now enables larger-scale population-based WGBS studies to be potentially more cost-effective.

Results

Comparison of library sequencing preparation methods using intact genomic DNA with the Illumina HiSeq 2500 platform

To optimise the coverage output per lane from WGBS, we first compared different library preparation methods on the HiSeq 2500 on a rapid run (two lanes) to determine differences in library fragment size distribution. We compared two pre-bisulphite (pre-BS) library preparation methods, where adaptor tagging and ligation are performed before bisulphite conversion, and three post-bisulphite (post-BS) methods, where adaptor tagging and ligation are performed after bisulphite conversion [33] (‘Methods’ section). The pre-BS library preparation methods used were KAPA LTP and KAPA Hyperprep, from KAPA Biosystems, with input amounts of, 1000 and 100 ngs, respectively. The post-BS methods we used were TruSeq DNA methylation kit from Illumina, TruMethyl WG from Cambridge Epigenetix and Accel-NGS Methyl-Seq from Swift Biosciences, with input amounts of, 50, 200 and 100 ngs DNA, respectively. In addition to the original methods from the manufacturer, we also modified the AMpure XP bead ratios during the size selection step of the library preparation to determine if a change in bead ratios had an effect on the library fragment size and subsequent overall sequencing coverage across the genome. Details of the modifications are provided (‘Methods’ section). For consistency, the libraries were prepared from the same cell line (LNCaP) DNA sample.

Table 1 summarises the differences observed in percentage of duplicate reads, library fragment size distribution, genome-wide coverage and the ratio of bias in coverage across CpG islands and CpG shores (illustrated in Fig. 1a). Details of how the bias ratio was estimated are provided in the methods section. First we found that the library preparation methods with altered size selections (< 300 and > 400 bp) gave rise to different library fragment size distribution (Table 1 and Fig. 1b) and associated differences in genomic coverage. For example, the KAPA LTP kit < 300 bp size selection resulted in average library fragment size 175 bp and 8.8× per lane of sequencing whereas > 400 bp size selection resulted in average library fragment size 237 bp and 17.2× per lane of sequencing (Table 1). The trend for larger library fragment size to improve genomic coverage was observed across all the different library preparation methods (Fig. 1c). Second, we observed that, among the manufacturer’s recommended methods, both the KAPA LTP and KAPA Hyperprep result in an under-representation of reads across CpG islands (Additional file 2: Fig. 1a, b), whereas the TruSeq DNA methylation method shows a bias towards CpG islands (Additional file 2: Fig. 1c). The TruMethyl WG showed the least bias (0.8 CpG islands/1.1 CpG shores), with almost equal coverage across CpG islands, CpG shores and other regions of the genome (7.68×, 10.80×, 10.08×) closely followed by the Accel-NGS-Methyl-Seq method (Additional file 2: Fig. 1d, e). Based on these combined results for coverage and bias, we decided to further optimise the TruMethyl WG library method for WGBS of intact DNA on the HiSeq X Ten platform.

Table 1 Comparison of different library preparation kits on intact genomic DNA using HiSeq 2500 Platform
Fig. 1
figure 1

Comparing different library preparation methods using genomic DNA on the HiSeq 2500. a Cluster plot of sequencing output metrics obtained from different library preparation methods from intact genomic DNA. b Bar graph showing fragment size distribution for different bead size selection for each library preparation method. c Plot with x-axis indicating coverage and y-axis indicating fragment size illustrating how increase in fragment size leads to improved coverage

Optimisation of library loading concentration and spike-in balanced libraries on the HiSeq X Ten

A challenge in performing WGBS on the HiSeq X Ten is the unbalanced base composition of the bisulphite library, and therefore, the bisulphite-treated DNA requires spiking with DNA of a normal base composition in order to maximise the cluster passing filter during sequencing. For whole genome sequencing (WGS), the optimal loading concentration of library recommended by Illumina is 300 pM. Therefore, we used 300 pM as the loading concentration for the WGBS library (Table 2), cell line (LNCaP DNA). To balance the nucleotide composition of the bisulphite library, we spiked in PhiX (from Illumina) at 25% of 250 pM and obtained a coverage of 8.7× from one lane of sequencing on the HiSeq X Ten (Table 2). To further improve the coverage, we lowered the loading concentration of the bisulphite library to 250 pM and performed sequencing with two different loading concentration of PhiX spike-in, namely 250 and 300 pM. This improved coverage from 8.70× to 15.19× using the same cell line input DNA (Table 2). To further test which of the PhiX loading concentration gave a higher coverage, we compared the sequencing coverage output from six different DNA samples isolated from blood, three of which were spiked with 25% of 250 pM PhiX and the other three were spiked with 25% of 300 pM PhiX spike (Table 2, blood DNA samples). We observed that the loading concentration of 250 pM for the bisulphite library with a combination of 300 pM loading concentration for PhiX spiked at a percentage of 25% gave a marked and consistent increase in overall coverage of the bisulphite genome (16.59–20.24×) (Fig. 2a).

Table 2 Comparison of coverage obtained with different bisulphite library and PhiX loading concentrations
Fig. 2
figure 2

Optimisation of spike-in library and its loading concentration. a Bar plot showing the difference in coverage obtained from sequencing two different loading concentrations of the PhiX spike-in library for six WGBS libraries obtained from blood DNA samples. b Bar plot showing the coverage obtained from both the bisulphite library and genomic library when 25% of genomic library is spiked in instead of PhiX, for five different bisulphite libraries prepared by the TruMethyl WG method. c Box plot showing the coverage distribution of the same set of bisulphite libraries when spiked with either 25% of 250 pM genomic library or 25% of 300 pM of PhiX library

However, the problem with using PhiX DNA to spike is that at least 25% of the sequencing reads are ‘wasted’. Therefore, to maximise the sequencing information we tested the effect of using genomic DNA (cell line DNA) as the balanced DNA control spike. First we spiked in 25% of 250 pM of unconverted genomic library DNA (similar size to the matching bisulphite libraries) to 5 independent libraries prepared from cell line DNA and DNA from blood (Table 3). We obtained bisulphite genome mean coverage of ~ 15× per lane of the HiSeq X Ten and the whole genome mean coverage of ~ 8.4× per lane (Table 3; Fig. 2b). We next compared the coverage output between 25% of 300 pM PhiX and 25% of 250 pM genomic spike-in using the same set of 10 independent bisulphite libraries (blood DNA, cell line DNA) (Fig. 2c). Interestingly, we found that the best spike-in DNA to give maximum bisulphite coverage output is 25% of 300 pM PhiX, which resulted in 15.5×–20.2× WGBS coverage (Table 4). Therefore, if only methylome data are required, our results suggest that PhiX spike-in is preferable.

Table 3 Comparison of coverage from bisulphite and genomic library when 25% of genomic library is spiked
Table 4 Comparison of coverage and duplicate reads for bisulphite libraries when spiked with genomic library or PhiX

The potential for HiSeq X Ten to provide integrated WGS and WGBS on intact genomic DNA

To further capitalise on informative sequencing data per lane of sequencing, we tested the output reads from simultaneously sequencing the methylome and genome from the same DNA samples. We first tested spiking four independent bisulphite-converted libraries from clinical prostate cancer DNA samples (intact genomic DNA) with 250 pM of their matching genomic library in a 50:50 ratio. We performed the library prep and spike-in and sequencing on the HiSeq X Ten in technical duplicates (Table 5a, b) and observed that overall coverage per lane (for WGBS and WGS library together) is ~ 26× the library prep and spike, consisting of bisulphite library (~ 10–13× the library prep and spike) average coverage and matching genomic sequence coverage (~ 13–16× the library prep and spike) (Table 5; Fig. 3a). Using the Meth10X pipeline (see methods), we could detect exemplary SNPs from the WGBS data at coverage of 13× (Fig. 3b; Additional file 2: Fig. 2), which were confirmed in the WGS spike-in data of the same sample.

Table 5 Comparison of coverage from bisulphite and genomic library when sequenced in a 50:50 ratio on a single lane of HiSeq X Ten
Fig. 3
figure 3

Integrating whole genome and whole genome bisulphite sequencing. a Bar plot depicting the coverage obtained when both genomic and its corresponding bisulphite library is sequenced on the same lane of the HiSeq X Ten, for four prostate cancer samples sequenced in duplicate (a, b). b A representative IGV plot showing a C to T SNP identified in both the WGS and WGBS data at approximately 13× coverage. c Bar plot indicating the percentage of SNPs from WGBS concordant in spike-in WGS at 13× and 26× coverage. d A representative Venn diagram for one prostate cancer sample, 2a showing the number of SNPs concordant and discordant at 13× and 26× coverage for both WGBS and spike-in WGS

Approximately 40–45% of SNPs called from WGBS (~ 10–13×) coverage data were found to be concordant with the SNPs identified from spike-in WGS (~ 13–16×) coverage data for each sample tested, (Fig. 3c, d; Additional file 1: Table 1). At higher coverage (~ 20–26×), the number of SNPs commonly called between WGBS and spike-in WGS data increased to 51–53% (Fig. 3c, d; Additional file 1: Table 1). To compare the fidelity of SNPs called from WGBS and WGS spike-in data, we first identified the overlap of variants called in two single lanes of HiSeq X Ten sequencing (each lane corresponding to 30× coverage) for each of the 4 clinical prostate cancer samples (see methods) and termed WGS ‘Gold Standard’ SNP data (WGS-GS). We found that ~ 95–96% of the SNPs detected in the spike-in WGS data (Additional file 1: Table 2; Additional file 2: Fig. 3) and ~ 55–57% of SNPs from WGBS data to be concordant with the WGS-GS data (Additional file 1: Table 3; Additional file 2: Fig. 3) indicating a higher degree of false positives called in the WGBS data, as previously reported [34]. However, we did find that there is a similar distribution of variant calls across all genomic features (Additional file 1: Table 4). We also found that WGBS detects a higher percentage of SNPs at regions with higher GC content (0.5–0.75), whereas WGS-GS detects a higher percentage of SNPs at regions with lower GC content (0.1–0.4) (Additional file 2: Fig. 4a). In addition, a higher percentage of SNPs were detected by WGS-GS near CpG sites which have high DNA methylation levels, (> 0.75) (Additional file 2: Fig. 4b), whereas WGBS detects a higher percentage of SNPs near CpG sites which are unmethylated or are lowly methylated, (< 0.5) (Additional file 2: Fig. 4b). Together our analyses indicate that integrated sequencing of WGBS and WGS libraries provides an efficient and cost-effective method to explore combinatorial analyses of genetic and epigenetic variations on one common technology platform.

Correlation between library duplicate reads, spike-ins comparing HiSeq 2500 and HiSeq X Ten sequencing platforms

In general, we also observed that the duplicate read percentage for bisulphite libraries on the HiSeq X Ten were much higher for all than the duplicate reads observed for bisulphite libraries on the HiSeq 2500 platform (Tables 1, 4). To investigate this further, we compared the duplicate reads obtained from bisulphite libraries of cell line DNA samples sequenced on both the HiSeq 2500 and the HiSeq X Ten. Interestingly, for the same library preparation, we consistently obtained more duplicate reads on the HiSeq X Ten than the HiSeq 2500. For example, the duplicate read on the HiSeq 2500 for the bisulphite libraries (cell lines) was ~ 1.2–2.7 and ~ 15–18% on the HiSeq X Ten (Table 6); despite this, the coverage was consistently higher using the HiSeq X Ten.

Table 6 Comparison of duplicate reads obtained for the same libraries sequenced on both HiSeq 2500 and HiSeq X Ten

To next determine if the higher coverage contributes towards the greater duplicate reads of the bisulphite libraries observed on the HiSeq X Ten, we measured the distribution of duplicate reads after randomly downsampling the number of raw reads obtained from the HiSeq X Ten platform to approximately the same number of raw reads obtained from the HiSeq 2500. Random downsampling was performed approximately 100 times, and the distribution of duplicate reads was estimated for two DNA samples (Fig. 4a, b). We observed that the distribution of duplicate reads for the bisulphite libraries on the HiSeq X Ten was similar across the hundred simulations (Fig. 4, b). For example, the distribution of percentage of duplicate reads in sample 1 ranged between 17.85 and 17.88% for different simulations and the frequency of such occurrence for each simulation is shown in the y-axis (Fig. 4a). These results support that the high duplicate reads observed on the HiSeq X Ten are a machine-generated artefact rather than an indication of library complexity.

Fig. 4
figure 4

Correlation between duplicate reads, spike-ins and the sequencing platforms. a, b Plots showing the frequency of distribution of duplicate reads for two cell line DNA samples during down sampling of the raw reads from HiSeq X Ten to the number of raw reads obtained from HiSeq 2500. c Box plot showing the difference in duplicate percentage when the same set of ten bisulphite libraries were spiked with 25% of 250 pM genomic library and 25% of 300 pM PhiX library

We also observed that the rate of duplicate reads for the same bisulphite library was higher when the library was sequenced with the 300 pM PhiX spike compared to the 250 pM genomic spike (Fig. 4c; Table 4, for ten independent bisulphite libraries). Differences in duplicate read numbers for the same bisulphite library preparation with different spike-ins suggest that the duplicate reads are not due to PCR amplification bias inherent to the library, instead we surmise that they are due to the way the clusters are spatially distributed in the nanowells of the HiSeq X Ten’s patterned flow cells. This spatial distribution of the clusters could be dependent on the nature of spike-in DNA library loaded, its concentration and insert size.

Comparison of overall coverage between HiSeq 2500 and HiSeq X Ten

Next, we compared the difference in overall genome coverage and individual CpG site coverage obtained from WGBS from four clinical samples prepared by the TruMethyl WG method, on the HiSeq 2500 High Output mode (HO) and on the HiSeq X Ten (Fig. 5a, b). The bisulphite libraries were all spiked with 25% of 300 pM PhiX for WGBS on the HiSeq X Ten and gave an overall coverage of ~ 16–20× per lane compared to ~ 8× coverage per lane on the HiSeq 2500. Figure 5a summarises the whole genome coverage plot for the four samples sequenced individually on single lanes of the HiSeq X Ten and merged coverage obtained from multiplexing the four samples on one lane of HiSeq 2500 high output mode. We found that almost 75% of the genome is covered at a depth of 10× from the HiSeq X Ten single lane sequencing, while only 40% of the genome is covered at a depth of ~ 10× from one lane of sequencing on the HiSeq 2500 (Fig. 5a). In addition, we assessed the coverage at individual CpG sites and found that only 30% of CpG sites were covered at a depth of ~ 10× from the HiSeq 2500 sequencing (Fig. 5b). However, ~ 70% of CpG sites were covered at a depth of ~ 10× from the HiSeq X Ten sequencing (Fig. 5b). We also compared the difference in coverage at specific genomic regions including, exons, intergenic regions, introns, promoters and repeat regions between the HiSeq 2500 platform and HiSeq X Ten platform for a clinical sample and a cell line (Fig. 5c, d, Additional file 2: Fig. 5a, b). We find that the HiSeq 2500 sequencing platform results in coverage ranging between 2.7× to 5.0× per lane across the different genomic regions, whereas the HiSeq X Ten platform results in coverage ranging between ~ 15× to 20× per lane across these same regions (Fig. 5c, d; Additional file 2: Fig. 5a, b).

Fig. 5
figure 5

Coverage comparison between HiSeq 2500 and HiSeq X Ten. a Plot shows the fraction of genome covered at different depths for four samples sequenced together on one lane of the HiSeq 2500 versus when each of the samples is sequenced on a single lane of the HiSeq X Ten. The coverage plot for the HiSeq 2500 HO mode is the merged coverage obtained from multiplexing the four samples. b Plot shows the fraction of CpG sites covered at different depths when four clinical samples are sequenced together on one lane of the HiSeq 2500 versus when each of the samples is sequenced on a single lane of the HiSeq X Ten. c, d Box plot showing the coverage distribution across exons, intergenic regions, introns, promoter regions and repeat regions of the genome for a sample sequenced on one lane of HiSeq X Ten (c) and HiSeq 2500 (d)

Comparison of library preparation methods on FFPET DNA on the HiSeq 2500 Platform

Archival FFPET DNA is a valuable resource in cancer research to explore methylation alterations in cancer samples retrospectively; however, the DNA is generally degraded and genome-wide methylation analysis can be a challenge. In order to determine if FFPET DNA can be used for WGBS on the HiSeq X Ten platform, we first performed a comparison of two pre-BS and the three Post-BS methods on the HiSeq 2500 platform, using FFPET DNA, isolated from prostate cancer biopsies (Table 7). Since FFPET DNA is commonly degraded (< 300 bp), a comparison of the library preparation methods with varying size selection was not feasible. We found that for FFPET DNA the different library methods were fairly similar and gave lower coverage than for intact DNA, ranging from 3.9× (KAPA Hyperprep) to 6.6× coverage (Accel-NGS Methyl-Seq method) from two lanes (one rapid run on the HiSeq 2500) (Table 7; Fig. 6). Overall, with regard to input amount, fragment size and coverage, the Accel-NGS Methyl-Seq method performed the best using 100 ng of DNA and 132 bp fragment library size (Fig. 6). However, with regard to the bias ratio for representation of CpG islands and CpG shores, we found that the Accel-NGS Methyl-Seq method under represented CpG islands (ratio of 0.6:0.9), whereas the TruMethyl WG method showed good coverage across these CpG-rich features (ratio of 1.1:1) (Table 7). We therefore decided to test both the TruMethyl WG and Accel-NGS Methyl-Seq method for WGBS performance on the HiSeq X Ten.

Table 7 Comparison of different library preparation kits on FFPET using HiSeq 2500 Platform
Fig. 6
figure 6

Comparing different library preparation methods using FFPET DNA. Cluster plot of sequencing output metrics obtained from different library preparation methods from FFPET DNA

WGBS of FFPET DNA on the HiSeq X Ten

We first compared WGBS coverage and bias outputs on the HiSeq X Ten from FFPET DNA using the TruMethyl WG and Accel-NGS Methyl-Seq method, with 25% of 300 pM PhiX as the spike-in concentration. We observed that the Accel-NGS Methyl-Seq method gave higher coverage (~ 13.05–13.97×) per lane and lower duplicate reads than the TruMethyl WG method (10.84–11.16×) for the same FFPET DNA (Table 8). However, as we found for intact DNA, the ratio of representation of CpG islands and CpG shores is less biased using the TruMethyl WG method, for example, 0.4:0.9 for Accel-NGS Methyl-Seq versus 0.8:1.1 for TruMethyl WG (Table 8). We further confirmed the apparent bias by determining the average coverage across CpG islands, CpG shores and other regions of the genome for WGBS data obtained from both the TruMethyl WG method and the Accel-NGS Methyl-Seq method (Fig. 7a, b). We then investigated candidate CpG islands and compared the reads spanning the CpG islands, between the TruMethyl WG method and the Accel-NGS Methyl-Seq method (Fig. 7c; Additional file 2: Fig. 6a, b) We also computed the CpG coverage distribution across exons, intergenic regions, introns, promoters and repeat regions of the genome (Fig. 7d, e). Our results confirmed that the representation of reads across specific genomic features is less biased in the TruMethyl WG method. However, the Accel-NGS Methyl-Seq method represents a higher fraction of repeat regions than the genomic features (Fig. 7d, e).

Table 8 Comparison of coverage output of FFPET on the HiSeq X Ten using two library preparation kits
Fig. 7
figure 7

Difference in HiSeq X Ten coverage distribution for FFPET bisulphite library prepared from two methods. a, b Box plot showing the difference in coverage across CpG islands, CpG shores and other regions of the genome for TruMethyl WG (a) and Accel-NGS Methyl-Seq (b) methods, when sequenced on the HiSeq X Ten. c IGV plot showing the difference in distribution of reads for a FFPET library obtained from the TruMethyl WG method and Accel-NGS Methyl-Seq method across a CpG island. d, e Box plots showing the coverage distribution across exons, intergenic regions, introns, promoter regions and repeat regions of the genome for a FFPET library prepared by the TruMethyl WG (d) and Accel-NGS Methyl-Seq (e) methods and sequenced on one lane of HiSeq X Ten

Comparison of overall methylation correlation for intact genomic DNA and FFPET between HiSeq 2500 and HiSeq X Ten

We observed that despite differences in the chemistry between the two Illumina sequencing platforms and resulting genome-wide coverage per lane of sequencing, there is a remarkable correlation between the methylation calls obtained from the HiSeq 2500 and HiSeq X Ten, as demonstrated using DNA from a cell line, clinical sample and FFPET sample (Pearson r > 0.94) (Fig. 8a). We next classified CpG sites into four bins based on their methylation level percentage, namely under-methylated (0–20%), low methylation (20–50%), intermediate methylation (50–80%) and high methylation (80–100%), and determined the percentage of CpG sites, in the different methylation bins between the two platforms (Fig. 8b). Again we found high concordance in methylation levels between both the HiSeq 2500 and the HiSeq X Ten platform. To further test the agreement between the methylation data at different bins of CpG methylation percentage, we used the Kappa statistics [35]. The average kappa values for 6 sample pairs, (two cell lines, two clinical samples and two FFPET samples), compared between the two platforms is > 0.75 for the cell lines, ~ 0.75 for the clinical samples and 0.6 for FFPET (Fig. 8c). These values indicate that there is minimal difference in methylation calls obtained from both platforms; 0.21% CpG sites discordant in cell line DNA, 0.17% in the clinical samples and 0.49% in the FFPET sample. We found that these discordant sites were distributed across genomic features, including promoter, exon, intron, intergenic and repeat regions for all three-sample types (Fig. 8d). The higher rate of discordance identified for FFPET DNA could relate to the smaller fragment size relative to the intact DNA samples.

Fig. 8
figure 8

Comparison of methylation correlation between HiSeq 2500 and HiSeq X Ten. a Correlation plots of methylation levels obtained from a cell line, a clinical sample and a FFPET sample sequenced on the Hiseq 2500 versus HiSeq X Ten (Pearson r > 0.94). b Correlation of methylation values obtained from HiSeq 2500 and HiSeq X Ten for a cell line, clinical sample and a FFPET sample after grouping them into four bins of methylation percentages. c Average kappa values for six sample pairs, including two cell lines, two clinical samples and two FFPE samples compared between the HiSeq 2500 and HiSeq X Ten platform. d Bar plot showing the distribution of percentage of discordant sites across the genome for a cell line, clinical sample and a FFPET sample

Discussion

Over the past decade large-scale genome-wide methylation studies have been performed primarily with the Illumina 450 K methylation array platform, RRBS technology or targeted approaches [36,37,38]. Though these profiling methods are relatively cost-effective, they can only assess less than 5% of the CpG sites across the genome and primarily are biased to CpG islands [39]. Therefore, there is a strong need to develop an economical method to allow genome-wide methylation analysis. Currently, WGBS provides the gold standard for methylome analysis at single base resolution. However, the high sequencing cost, considerable technical expertise required and associated bioinformatic challenges to process the data have limited the widespread application of WGBS. The advent of HiSeq X Ten has opened up possibilities of performing WGBS with greater coverage per lane at a relatively lower cost.

Here we compare the most common available library preparation methods and identify the most efficient method to prepare WGBS libraries to achieve the greatest sequencing coverage on the HiSeq X Ten (HCS 3.3.76). We provide strategic guidelines that routinely produce ~ 16–20× coverage per lane of sequencing for intact human DNA and ~ 10–13× coverage per lane from FFPET samples. We find that coverage is primarily influenced by library fragment size and the nature and amount of the spike-in DNA. First, we show that the library choice can influence coverage and this is due to the resulting library fragment size that is obtained, as increasing fragment size results in higher sequencing coverage. Even greater coverage is expected if 100 bp PE reads could be supported on the HiSeq X Ten, since the current requirement to use 150 bp PE reads means there is still considerable read wastage due to the smaller fragments generated after bisulphite treatment.

Second, we find that adding insufficient amounts of the spike-in library can lead to poor cluster passing filter of the bisulphite library leading to lower coverage. We identify an optimum amount of spike-in libraries both for PhiX and genomic DNA to achieve the best coverage for human bisulphite libraries. Even though PhiX resulted in more WGBS coverage per lane of sequencing, the advantage of spiking genomic DNA from the same sample into the sequencing run allowed for the potential to identify SNP variants along with DNA methylation calling and therefore reduces read wastage.

Finally, our study compares the coverage obtained for a human bisulphite library from one lane of HiSeq 2500 sequencing and one lane of HiSeq X Ten sequencing. We show there is a high concordance in methylation levels obtained from both the platforms. However, only 30% of the CpG sites are covered at a depth of 10× in sequencing from the HiSeq 2500 platform, and almost 70% of the CpG sites are covered at a depth of 10× in the sequencing from the HiSeq X Ten platform.

A summary of our workflow for achieving optimal coverage on the HiSeq X Ten for intact genomic DNA and FFPET DNA is shown in Additional file 2: Fig. 7a, b. This workflow can also be potentially transferred to the newer software version HCS 3.4.0.38 and the newer Illumina sequencing platform, NovaSeq, which uses patterned flow cell similar to the HiSeq X Ten. The applicability of WGBS on large-scale epigenome-wide mapping studies is on the rise, and different technologies will appear. In fact a study recently published in BioRxiv [40] has devised a completely new library preparation strategy involving tagmentation and bisulphite tagging and a sequencing approach using custom sequencing oligos to achieve high-coverage WGBS on the HiSeq X Ten. However, our results and observations provide an established protocol for generating good quality WGBS data of high coverage at a reasonable cost and in combination with WGS herald a new era for integrated genomic and methylation sequencing studies.

Conclusions

In this study, we provide a systematic, efficient and complete approach to perform and analyse WGBS on the HiSeq X Ten. Our protocol allows for large-scale WGBS studies at reasonable processing time and cost on the HiSeq X Ten platform.

Methods

Cell lines

Prostate cancer cell line, LNCaP, and breast cancer cell line, MCF7, were obtained from the American Type Culture Collection (ATCC). The endocrine-resistant MCF7-derived cell line and tamoxifen-resistant (TAMR) were generated by the long-term culture of MCF7 cells in phenol red-free RPMI medium with 5% charcoal stripped FCS and 4-OH-tamoxifen [41]. The B80 cell lines used in Tables 3 and 6 are human mammary epithelial cells immortalised by simian virus 40 T-antigen [42]. All cell lines were cultured under recommended conditions at 37 °C and 5% CO2. Sample 27, from Table 4, is THP-1 cell line, and sample 28 is THP-1 treated with PMA. Human THP-1 cells were maintained in RPMI media supplemented with 10% (v/v) FCS, 0.05 mM 2-mercaptoethanol, 0.1 mg/ml penicillin/streptomycin and 2 mM l-glutamine. Differentiation of THP-1 into macrophages was performed by culturing the cells with 100 ng/ml phorbol-12-myristate 13-acetate (PMA) and 50 μM 2-mercaptoethanol for 48 h.

DNA samples

Genomic DNA from cell lines was extracted using QIAmp DNA Mini kit (Qiagen, USA). DNA from blood samples and clinical prostate cancer samples were also extracted using the QIAmp DNA Mini kit (Qiagen, USA).

Library preparation methods

Pre-BS library preparation method

The two pre-BS library preparation methods were the KAPA LTP library preparation method and the KAPA Hyper prep library preparation method, which was performed following the manufacturer’s protocol. For the size selection steps, with the KAPA LTP method, a dual size selection ratio of 0.5:1.0 followed by 0.7:1.0 was performed to get library size of approximately 400 bp. For a library size of 300 bp, we followed the manufacturer’s protocol where the size selection ratio recommended was 0.6:1 followed by 0.8:1. To achieve library fragment size bigger than 300 and 400 bp, fragments that were not bound to the beads were eluted out. With the KAPA Hyperprep method, the protocol by the manufacturer was followed except at the post-PCR clean-up size selection step, and two different AMPure XP bead ratios were used 0.75:1 and 0.85:1 to achieve library sizes of above 300 bp and above 200 bp, respectively. In both methods, the bisulphite conversion was performed using the EZ DNA Methylation Gold kit from Zymo Research.

Post-BS library preparation methods

The three post-BS library preparation methods used are the TruSeq DNA methylation kit from Illumina, the Accel-NGS Methyl-seq DNA library preparation from Swift Biosciences and the TruMethyl WG method from Cambridge Epigenetix (CEGX). The library preparation using the TruSeq DNA methylation kit was performed exactly as the manufacturer’s protocol. With the Accel-NGS Methyl-seq DNA library prep method, besides following the original protocol, a size selection of 0.85:1 was performed during the post-PCR SPRI clean-up step to achieve library size greater than 300 bp. For the TruMethyl WG method, library preparation and indexing were carried out as described in the CEGX TruMethyl WG user guide v2. Since the presence of even 5% of adaptor dimers in the library leads to a 60% contamination of adaptor dimer reads on the HiSeq X Ten, we further improved the protocol by adding an additional clean-up step in the end using AmPure Xp beads at a ratio of 50:50. For both TruSeq DNA methylation kit and the Accel-NGS Methyl-Seq kit, bisulphite conversion was performed using the EZ DNA Methylation Gold kit from Zymo Research. The TruMethyl WG method has its own bisulphite conversion process incorporated in the protocol.

Library QC and quantification

All libraries were quantified using the Qubit and KAPA library quantification kit (KAPA Biosystems), and the library quality was assessed using the High-sensitivity DNA kit on the Agilent 2100 Bioanalyzer (Agilent, CA, USA). Paired-end 150 bp sequencing was performed for each library on the Illumina HiSeq 2500 and the HiSeq X Ten platform.

CpG islands and CpG shores bias analysis

The genomic coordinates of CpG islands were obtained from http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/cpgIslandExt.txt.gz, and CpG shores are defined as the regions immediately flanking CpG islands up to 2kbp away from both sides of the islands. CpG others are all the hg19 human regions, which are not either in CpG islands or in CpG shores. To compute CpG coverage distribution of CpG islands, bedtools [44] was used to intersect genomic coordinates of CpG islands with coverage data of all hg19 CpG sites (~ 28 million CpGs). The same procedure was applied for both CpG shores and others. Finally, the coverage for the three classes was plotted as box plot using ggplot2 in R. With regard to the bias, a value < 0.75 indicates under-representation and > 1.2 indicates over-representation of reads across CpG islands and CpG shores.

Exons, intron, promoter, repeats and intergenic coverage analysis

Annotation of known gene transcripts and repeat elements was obtained from UCSC (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/knownGene.txt.gz, http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/rmsk.txt.gz). Genomic coordinates of repeat elements were obtained from rmsk.txt.gz file. Promoters are regions of non-repeat bases containing all bases ranging from upstream 1500 and downstream 500 base pairs of a known transcription start site (TSS) in knownGene.txt.gz and not overlapping with itself. Exons were non-repeat bases and obtained from knownGene.txt.gz and not overlapping with itself. Introns were non-repeat bases and bases that are flanked by two exons of a single transcript and no overlapping with itself. Finally, intergenic regions were identified as the remaining bases in the reference genome. For computing the CpG coverage distribution of exons, introns, promoter regions, repeats and intergenic regions, bedtools was used to intersect the relevant genomic coordinates with coverage data of all hg19 CpG sites. Finally, the coverage for all the five classes was plotted as box plots using ggplot2 in R.

Estimating read duplication rate for WGBS data of lower coverage

Nearly 50% of reads were randomly withdrawn from bam files of HiSeq X Ten’s WGBS data of approximately 20× coverage by using samtools [43], and Picard tool 2.3.0 (http://broadinstitute.github.io/picard) was used to measure the read duplication rate of the down sampled bam file. This procedure was repeated 100 times to get the estimation of read duplication rate for the low-coverage WGBS data (5–10X).

Evaluation of DNA methylation agreement between different platforms

Pairs of WGBS data in HiSeq 2500 and HiSeq X Ten with coverage of at least 15× for every CpG site were used to compute the Pearson correlation. These CpG pair sites were then plotted with smoothScatter in R. To further evaluate the DNA methylation agreement between the platforms, we binned these DNA methylation values into 4 bins of (0–20%), (20–50%), (50–80%) and (80–100%) as no methylation, low, intermediate and high DNA methylation, respectively. Then, these bins were plotted as a heatmap using gplots [44]. Finally, Kappa statistics was used as a measure of agreement between two different platforms [35] with function kappa2 (without weighted) in IRR package [47]. Kappa values from − 1 to 1 were assigned to the samples with values from + 0.0 to 0.2 indicating slight agreement, + 0.21–0.40 indicating fair agreement, + 0.41–0.60 indicating moderate agreement, + 0.61–0.80 indicating substantial agreement, + and 0.81 to 1.0 indicating perfect agreement.

HiSeq X Ten WGBS processing pipeline

The increased coverage achieved on the HiSeq X Ten generates > 1 Tb of data per run. To process the data, we developed a new bioinformatics pipeline package, Meth10X, based on previously published P3BSseq package [45], to support the increased number of bisulphite reads and reduce the processing time significantly. The Meth10X pipeline takes raw bisulphite reads in fastq files and trims the adapters following the guide of pre kit. The trimmed fastq files are aligned with bwa-meth [46] to the reference genome. The generated bam files are marked with duplication and merged if necessary. Estimation of the duplication rate, coverage bias (genomic features) and methylation bias in reads is carried out to provide quality control. We also use Qualimap 2 [47] for further evaluation of the alignment, such as percentage of unmapped/mapped read metrics, mapping quality distribution, GC content distribution, insert size distribution and coverage distribution. The pipeline, Meth10X, can be accessed from github, https://github.com/luuloi/Meth10X. For DNA methylation calling, MethylDackel (https://github.com/dpryan79/MethylDackel) is used with three different cytosine context patterns (CG, CHH, CHG) and strand specificity. MethylDackel also gives more options to remove methylation bias by trimming bias before calling methylation levels.

Estimating concordant SNPs from WGBS and WGS

For calling SNPs from WGBS, we incorporated Biscuit (https://github.com/zwdzwd/biscuit) within the Meth10X pipeline. For calling SNPs from spike-in WGS 50:50 mix data and WGS-GS, we used GenomeAnalysisTK [48] following GATK best practices for variant calling. All the SNP sets obtained were then filtered with at least 5× coverage and QUAL of SNP with at least 200.0 for spike-in WGS data and WGS-GS and 20.0 for WGBS data. The SNP concordance and discordance between pairs (spike-in WGS, WGS-GS and WGBS) were evaluated on the filtered vcf files by using hap.py package, a Haplotype VCF comparison tool (https://github.com/Illumina/hap.py). For assessing the GC content across the genome for both WGBS and WGS-GS data, we used bedtools and human genome hg19 to create GC content with 100 bp sliding window, called GC content track. We then overlapped the vcf files of SNP from WGBS and WGS-GS with the GC content track to get the GC content score for every SNP. Followed by this, the density distribution of GC content score of WGS-GS and WGBS was plotted by ggplot2 in R. To get the methylated and unmethylated ratio of CpG sites near to a SNP across the genome for both WGBS and WGS-GS data, we used bedtools to get the CpG site nearest to a SNP within 50 bp. Further, the CpG nearest track for both WGBS and WGS-GS was overlapped with the methylomes of the same sample to get the methylation ratio. The density distribution of methylation ratio for WGBS and WGS-GS was plotted by ggplot2 in R.