CNV detection methods
In general there are three main approaches to identify CNV from next generation sequencing data: 1) read count, 2) paired-end, 3) assembly [31]. In the read depth (RD) approach mostly a non-overlapping sliding window is used to count the number of short reads that are mapped to a genomic region overlapped with the window. Then these read count values are used to identify CNV regions. Due to reducing the cost of sequencing and improving the sequencing technologies more and more high-coverage NGS data are available; as a result, RD-based methods have recently become a major approach to identify CNVs. Paired-end (PE) approach, which are applied to paired-end NGS data, identifies genomics aberration based on the distances between the paired reads. In paired-end sequencing data, reads from the two ends of the genomics segments are available. The distance between a pair of paired-end reads is used as an indicator of a genomics aberration including CNV. A genomic aberration is detected when the distance is significantly different from the predetermined average insert size. This approach is mostly used for identifying other type of structural variation (beyond CNVs) such as inversion and translocation. In the assembly approach short reads are used to assemble the genomics regions by connecting overlapping short reads (contigs). CNV regions are detected by comparing the assembled contigs to the reference genome. In this methods short reads are not aligned to the reference genome first. Since in WES targeted regions are exonic regions, they are very short and discontinuous across the genome. As a result, the PE and assembly approaches for identifying CNVs are not suitable for WES data. Also high coverage of WES data makes the RD approach more practical. Therefore, all CNV detection tools for WES are based on the RD approach.
In general, the RD approach consist of two major steps: 1) preprocessing, and 2) segmentation. The input data are aligned short reads in BAM, SAM or Pileup formats. In the preprocessing step, WES data’s biases and noise are eliminated or reduced. Normalization and de-noising algorithms are the main components of this step. In the segmentation step a statistical approach is used to merge the regions with the similar read count to estimate a CNV segment. The most commonly used statistical methods for segmentation are circular binary segmentation (CBS) and hidden Markov model (HMM). In CBS, the algorithm recursively localizes the breakpoints by changing genomic positions until the chromosomes are divided into segments with equal copy numbers that are significantly different from the copy numbers from their adjacent genomic regions. In HMM the read count windows are sequentially binned along the chromosome according to whether they are likely to measure an amplification, a deletion, or a region in which no copy number change occurred. Even though other statistical methods have been introduced for detecting CNVs from WGS data, these two methods are the most common methods that are used in the current CNV detection tools for WES data.
Challenges for detecting somatic CNVs in cancer
Despite improvements to sequencing technologies and CNV detection methods, identifying CNV is still a challenging problem. Complexity of tumors and technical problems of WES add more challenges to identifying somatic CNVs from WES data in cancer [31, 32]. In this section we briefly explain the challenges that somatic CNV identification are faced with in cancer when using WES data. We divide these challenges into three classes: challenges due to 1) sequencing data, 2) WES technical problems, and 3) tumor complexity.
Challenges due to sequencing data
The main assumption of the RD based CNV detection algorithms is that the read counts and CNV for a particular region are correlated. However, there are biases and noise that distort the relationship between the read count and copy number. These biases and noise include GC bias, mappability bias, experimental noise, and technical (sequencing) noise. GC content varies significantly along the genome and has been found to influence read coverage on most sequencing platforms [33, 34]. In the alignment step, a huge number of reads are mapped to multiple positions due to the short read length and the presence of repetitive regions in the reference genome [34, 35]. These ambiguities in alignment can produce unavoidable biases and error in RD based CNV detection methods [33]. Furthermore, sample preparation, library preparation and sequencing process introduce experimental and systematic noise that can hinder CNV detection [34, 36].
Challenges due to WES technical problem
The exome capture procedure in the library preparation process for WES introduces biases and noise that distorts the relation between read count and CNV. In the WES library preparation, the hybridization process produces biases. In addition, the distribution of read in the exonic regions is not even, which is another source of bias [37]. It is very common that in some genomic regions the read count is very low. This low read counts affect the statistical analysis for calling CNVs and as a result produce noise in the CNV detection algorithms.
Challenges due to tumor complexity
Complexity of cancer tumor also distorts the relationship between read count and CNV and as a result produces noise. The tumor complexity includes tumor purity, tumor ploidy, and tumor subclonal heterogeneity. Tumor samples are mostly contaminated by normal cells. Therefore, mapped read on a particular region are not all belong to tumor cells. As a result, read count values do not completely reflect copy number of tumor cells and the tumor normal copy number ratio is less than the real value. This introduces difficulties in calling copy number segments. A threshold for calling CNV will depend on tumor purity, which is usually unknown. There are a few tools available to estimate tumor purity [38, 39]. Aneuploidy of the tumor genome is observed in almost all cancer tumors [40], which creates difficulties in determining the copy number values. The normal tumor read count ratio is corresponding to the average ploidy, which is usually unknown in the tumor sample. It is observed that multiple clonal subpopulations of cells are present in tumors [41]. Due to their low percentage in a sample, it is hard to determine the subclones. This intra-tumor heterogeneity or multiple clonality distorts the CNV and makes calling CNV segments complicated.
CNV detection tools
AS of August 2016, we have identified fifteen sequence-based CNV detection tools (Additional file 1: Table S1) for WES data. Several studies have already evaluated and compared the performance of CNV detection tools for WES data [31, 32, 42]. However, the focus of their work has not been on cancer. In this work, we restricted the analysis and comparison of CNV tools to those that have been used or have the ability to detect cancer specific aberrations (somatic aberrations). Due to the fast advancing sequencing technologies, we also focused on the widely used and more recent tools. Out of the available CNV detection tools for WES data, we chose the tools that fit the criteria of (1) ability to detect somatic aberration, (2) using read depth (RD) method and (3) was published in the recent years or commonly used. Six tools meet the above criteria: (1) ADTEx [25], (2) CONTRA [43], (3) cn.MOPS [44], (4) ExomeCNV [45], (5) VarScan2 [46], and (6) CoNVEX [47]. ADTEx and CoNVEX were developed by the same group using a similar method, which ADTEx is the modified version of the CoNVEX. As a result, we only considered ADTEx. More recent tools, such as CANOES [48], ExomDepth [49], and cnvCapSeq [50], are not used specifically for cancer; therefore we did not consider them in this study. The list of the tools that we considered in this study and their general characteristics are provided in Table 1.
Table 1 Selected tools for the performance analysis of CNV detection tools using WES data
ADTEx [25] is specifically designed to infer copy number and genotypes using WES from paired tumor/normal samples. ADTEx uses both read count ratios and B allele frequencies (BAF) to detect CNV along with their genotypes. It addresses the problem of tumor complexity by employing BAF data, if these data are available. For normalization, ADTEx first calculates the average read count of exonic regions for both tumor and normal, and then computes the ratios of read counts for each exonic region. ADTEx also uses the Discrete Wavelet Transform approach as a preprocessing step to reduce the noise of read count ratio data. It uses the HMM method for segmentation and CNV call. Two HMMs are used in the detection algorithm: one to detect CNVs in combination with BAF signal to estimate the ploidy of the tumor and predict the absolute copy numbers, the other to predict the zygosity or genotype of each CNV segment. When the BAFs of tumor samples are available, they fitted the HMM for different base ploidy values. To determine the base ploidy, ADTEx selects the SNPs which overlaps with each exonic region, segments BAFs using CBS algorithm, estimates B allele count for different ploidy levels, and finally uses the distances between B allele counts to provide the best fit for base ploidy.
CONTRA [45] is a method used for CNV detection for targeted resequencing data, including WES data. It is designed to detect CNV for very small target regions ranging between 100 to 200 bp. The main difference between CONTRA and the other method is that it calculates and normalizes the read count and log ratio for each base (not a window or exon). This allows for better GC normalization and log ratio calculations for low coverage regions. After calculating base-level log ratios, it estimates region-level log ratios by averaging the base-level log ratios over the targeted regions (exons in WES). Then, it normalizes the region-level log ratios for the library size of control and normal samples. The significant values of the normalized region-level log ratios are calculated by modeling region-level log ratios as normal distribution. For detecting large CNVs spanning multiple targeted regions (exons), CONTRA performs CBS on region-level log-ratios. To call a CNV segment, at least half of the segment has to have overlap with the significant region-level CNVs. This method addresses the problems of some very low coverage regions and sequencing biases (GC bias), which are due to uneven distribution of reads in WES.
The main difference between cn.MOPS [44] and other tools is that it can use several samples for each genomics region to have a better estimate of variations and true copy numbers. cn.MOPS uses non-overlapping sliding window to compute read counts for genomic regions. To model read count, it employs a mixture of Poisson distribution across the samples. The model is used to estimate copy number for each genomic region. cn.MOPS does not calculate ratios of case and control. Instead it uses a metric that measure the distance between the observed data and null hypothesis, which is all samples have copy number of 2. If CNV differs from 2 across the sample, the metric is higher. This metric is used for segmentation by CBS per sample. At each genomic position, cn.MOPS uses the model of read counts across samples, so it is not affected by read count alteration along chromosomes. By using Baysian approach, cn.MOPs can estimate noise and so it can reduce the false discovery rate (FDR).
ExomeCNV is designed specifically for WES data using pairs of case-control samples such as tumor-normal pairs. It counts the overlapping reads for exons; and by using these read counts for tumor and normal, it computes the ratio of read counts for each exonic regions. Hinkley transformation (ratio distribution) is used to infer the normal distribution for the read count ratios. After finding ratios of tumor and normal for exonic regions, CBS is used for segmentation. If the tumor purity is given in advance, ExomeCNV will use it to compute copy numbers. It also can detect loss of heterozygosity (LOH) if BAF data is given. ExomeCNV divides the average read count by the overall exome average read count to normalize the average read count per exon.
VarScan2 [46] is also specifically designed for the detection of somatic CNVs in WES from tumor–normal pairs. To compute the read counts of bases, the algorithm considers only high quality bases (phred base quality ≥20) for tumor and normal samples individually. It does not use a sliding window or exons to generate read count data. Instead, it calculates tumor to normal read count ratios of the high quality bases that full fill the minimum coverage requirement. Then, in each chromosome, consecutive bases that their tumor to normal read count ratios do not change significantly, based on the Fisher’s exact test, are binned together as a genomic region to generate read count data. For each genomic region, copy number alterations are detected and then are normalized based on the amount of input data for each sample. A segmentation algorithm in not embedded into the VarScan2 tool and CBS algorithm is recommended for the segmentation of the genomic regions.
Data sets
In this work, we used real and simulated WES data to evaluate CNV tools’ performances.
Real data
We used ten breast cancer patient tumor-normal pair WES datasets from the cancer genome atlas (TCGA) to evaluate the performance of the CNV detection tools. The list of samples is given in the Additional file 1: Table S2. The WES data were generated by the Illumina Genome Analyzer platform at Washington University Genome Sequencing Center (WUGSC). The aligned BAM files of these 20 samples (10 tumor-normal pairs) were downloaded from The Cancer Genomics Hub (CGHub), https://cghub.ucsc.edu/index.html. We also used array-based CNV data from the same 10 tumor samples as a benchmark for the CNV detection tools evaluation. We downloaded SNP-array level 3 data from the Affymetrix genome-wide SNP6 platform from the TCGA data portal website (https://portal.gdc.cancer.gov/projects/TCGA-BRCA) for the 10 tumors.
Simulated data
To evaluate the performance of the tools, we have also used benchmark datasets generated by a CNV simulator, called VarSimLab [51]. VarSimLab is a simulation software tool that is highly optimized to make use of existing short read simulators. Reference genome in FASTA format and sequencing targets (exons in the case of WES) in BED format are inputs of the simulator. A list of CNV regions that are affected by amplifications or deletions is randomly generated according to the simulation parameters. The CNV simulator manipulates the reference genome file and the target file before generating short reads that exhibit CNVs. The output consists of: (i) a list file that contains the synthesized amplifications and deletions in txt format, (ii) short reads with no CNVs as control in FASTQ format, and (iii) short reads with synthesized CNV as case in FASTQ format.
We used VarSimLab to generate simulated short reads of length 100 bp for chromosome 1. We generated synthesized datasets with 3 M, 2 M, 1 M, 0.5 M, 0.1 M, 0.05 M, 0.01 M reads to simulate different coverage values (approximately from 0.2X to 60X in exonic regions). For each coverage value, we generated 10 datasets (70 datasets in total). These simulated data with known CNV regions were used to evaluate the performance of the CNV detection tools in terms of sensitivity and specificity for identifying CNV regions.
Comparison methods
To evaluate the performance of the tools in terms of sensitivity, false discovery rate (FDR) and specificity for detecting CNVs we compared their detected CNVs with the benchmark CNVs. For this comparison, we utilized two approaches: 1) gene-based comparison, and 2) segment-based comparison. Gene-based comparison analysis indicates the performance of the tools on calling CNVs only on exonic regions, which are the targets of the WES. However, segment-based analysis indicates the performance of the tools on overall calling CNV segments across the genome.
Gene-based comparison
For the gene-based comparison, we first annotated the detected CNV segments in the benchmark and samples for both real data and simulated data. We used “cghMCR” R package from Bioconductor [52] to identify CNV genes using Refseq gene identifications. The average of the CNV values of the overlapping CNV segments for each gene is used as the gene CNV value. A threshold of ± thr for log
2
ratios was used for calling CNV genes, that is: amplification for log
2
ratios > thr, deletion for log
2
ratios < − thr, and No CNV for log
2
ratios between - thr and thr.
For each tool, we computed sensitivity, specificity and FDR separately for amplification and deletion. If we name the detected CNV value for a specific gene as CNVtest and the benchmark CNV value of the gene as CNVbench, then we can define True Positive (TP), False Positive (FP), True Negative (TN) and False Negative (FN) for amplified and deleted genes as given in Table 2.
Table 2 Computing TP, FP, TN and FN for Gene-Based comparison of the performance of the tools
The sensitivities or true positive rates (TPRs), specificities (SPCs) and FDRs are calculated using the following equations for both amplified and deleted genes.
$$ TPR\kern0.5em =\kern0.5em \frac{TP\ }{\left( TP+ FN\right)}, $$
(1)
$$ FDR=\frac{FP\ }{\left( TP+ FP\right)}, $$
(2)
and
$$ SPC=\frac{TN\ }{\left( FP+ TN\right)} $$
(3)
For each tool we calculated TPRs, SPCs, and FDRs of the tools for all datasets and used their average values.
Segment-based comparison
For the segment-based comparison, we focused on comparing the CNV segments between detected CNVs and benchmark CNVs. Similar with the gene-based CNV comparison, we used a threshold (thr) to call amplified, deleted and no CNV segments. Comparing CNV regions between detected CNVs and their corresponding benchmark CNVs is more complicated than comparing CNV genes. Detected CNV segments, unlike CNV genes, have different sizes and different start and end positions compared to those of benchmark CNV segments. We used “GenomicRanges” R package from Bioconductor [52] to obtain overlapping regions between detected CNVs and benchmark CNVs. If an amplified/deleted segment of a sample, which has CNV > thr/ CNV < −thr, has an overlap of 80% or more with a benchmark amplified/deleted segment it was considered as TP. If we cannot find an overlap of 80% or more between a detected CNV region and any benchmark CNVs, the detected CNV segment was consider as FP. An amplified/deleted segment in the benchmark that does not have an overlap of 80% or more with any detected amplified/deleted regions was called FN. Since the regions with no CNVs cover very large sections of a genome we did not calculate TN regions. Therefor for segment-based comparison we calculated TPRs and FDRs as eqs. 1 and 2. If we name a CNV segment of samples as TestSeg and a CNV segment of benchmark as BenchSeg, we can calculate TPs, FPs and FNs as shown in Table 3.
Table 3 Computing TP, FP and FN for Segment-Based comparison