Background

Somatic mutations play a key role in cancer development, progression and chemotherapy resistance. Therefore, several studies have been profiling somatic mutations in cancer samples by applying next generation sequencing technologies, allowing the discovery of drug targets, prognostic DNA markers and protocols of targeted therapies. Whole Exome Sequencing (WES) has become a popular approach because it is cost effective and it detects approximately 25,000 single nucleotide variants (SNVs) in the coding region of human DNA. However, the detection of somatic mutations in normal-cancer paired samples presents unique challenges: 1) detecting low allelic frequency mutations due to tumor heterogeneity, subclonality and copy number variation events; 2) differentiating true mutations from alignment artifacts and sequencing errors; 3) classifying mutations as somatic or germ-line polymorphisms; and 4) analyzing tumor samples contaminated by normal cells and vice-versa [1]. The understanding of the mutational landscape of cancer genomes requires the development of methods that detect somatic mutations and deal with these challenges.

Several studies have compared the performance of different pipelines, softwares and parametrizations [27]. In general, the available tools classify the somatic mutations by either independently or simultaneously analyzing the tumor and normal samples; but, since they have different prior assumptions and error modeling approaches, many research groups have reported low concordance among methods [4, 8]. The available tools either detect too many false positives in order to get all true positives or lose too many true positives in order to reduce the number of false positives [9]. In the first case, the researcher spends much time and resource validating the set of candidate variants to select the true ones. In the second case, important mutations that explain the biological characteristics of the cancer cells, may be missed. This evidence, along with the variability in the performance of each software according to studies and tumor type, indicates that the research community faces a big challenge choosing the right pipeline among all available options.

In this study, we aimed to develop a pipeline that detects a wide and high confident profile of single nucleotide variants in sequencing data of cancer samples. Our pipeline brings together the benefits of two standard tools: Genome Analysis Toolkit (GATK) and MuTect. GATK independently calls variants in the normal and tumor samples, while MuTect performs the analysis simultaneously. We created the GATK-LODN method, which is part of the MuTect algorithm, that is applied downstream to the GATK analysis in order to ensure the somatic classification of the GATK results and reduce its false positive calls. As proof of principle, we applied our pipeline to hematological (Acute Myeloid and Acute Lymphoblastic Leukemias) and solid (Gastrointestinal Stromal Tumor and Lung Adenocarcinoma) tumors. We also tested our pipeline on simulated data and technical replicate samples to evaluate its sensitivity and specificity. Our results show that the pipeline performed well and we believe that it can be helpful in discovery studies aimed to profile the somatic mutational landscape of cancer genomes.

Methods

Sequencing data

Primary samples were collected from Acute Myeloid Leukemia (n = 37) and Acute Lymphoblastic Leukemia patients (n = 41) after obtaining informed consent as approved by the Institutional Ethical Committee (protocol number 253/2013/O/Tess) of Azienda Ospedaliero-Universitaria, Policlinico Sant’Orsola-Malpighi (Bologna, Italy) in accordance with the Declaration of Helsinki. Leukocytes were enriched from bone marrow and peripheral blood samples by separation on Ficoll density gradient. Saliva samples, used as normal matching, were collected with the Oragene Discover kit (DNA Genotek). The DNA was extracted from leukocytes by column purification (AllPrep DNA/RNA/Protein Mini Kit and QIAcube, Qiagen) and from saliva by paramagnetic particles (Maxwell® 16 LEV DNA Blood Purification Kit and Maxwell® MDx Instrument), according to manufacturer’s protocol. The exonic regions were captured by TrueSeq™ Exome Enrichment Kit and Nextera Rapid Capture Expanded Exome, comprising a targeted region of 62 Mb, and 201,121 exonic regions. Illumina HiSeq2000 sequencing produced an average of 55.2 and 63 million 100 bp paired-end reads per sample in AML and ALL cohorts, respectively. The AML and ALL data sets are available upon request to the Next Generation Sequencing for Targeted Personalized Therapy of Leukemia consortium. We also selected two public datasets of Illumina HiSeq 2000 whole exome sequencing from the NCBI Sequence Read Archive: 1) seven Gastrointestinal Stromal Tumors (GIST) samples, and their matching peripheral blood samples, with an average of 35.5 million 100 bp paired-end reads per sample [SRA: SRR1299130-141 and SRR1299144-147] [10]; and 2) two Lung Adenocarcinoma samples, and their normal counterparts, with an average of 56.5 million 100 pb paired-end reads per sample [SRA: ERR160124, ERR160136, ERR166338, and ERR166339] [11]. After the quality control check, the average of final coverages were: 72X (±30X), 119X (±28X), 76X (±7X), 133X (±64X); for AML, ALL, GIST, and Lung Adenocarcinoma, respectively (Additional file 1 provides, for each tumor type, the samples IDs and coverage information).

Pipeline for somatic variant discovery

Initially, the sequencing reads were submitted to a quality control check by using the scripts fastq_quality_filter.pl and fastq_quality_trimmer.pl from FASTX-Toolkit [12]. The phred value 20 was chosen as the minimum threshold for base quality. The reads having more than 80 % of low quality bases were removed or had their 3′ extremity bases trimmed when the minimum threshold was not reached. After, the reads were aligned to the human reference genome hg19/GRCh37 using BWA-MEM [13] with default parameters and Picard [14] was applied for post-alignment procedures as sorting, indexing, and marking duplicates. The alignments were submitted to local realignment around INDELs and base quality score recalibration (BQSR) by using the Genome Analysis Toolkit (GATK) version 3.0 [15].

MuTect [16] and GATK (Haplotype Caller) were used for the single nucleotide variant calling. GATK variants were filtered with the Variant Quality Score Recalibration tool following the best practices on the GATK website. GATK performs the variant calling and filtration in the normal and tumor samples independently, thus the subtraction between the tumor and the normal variants resulted in our first set of candidate somatic variants.

To ensure the somatic classification of the SNVs called by GATK, we adapted the MuTect algorithm and applied its LODN classifier after the GATK variant calling and filtering. The LODN is a bayesian classifier that compares the likelihood of two models: (1) the mutation does not exist in the normal sample and all non-reference bases are explained by sequencing noise, and (2) the mutation truly exists in the normal sample as a germ-line heterozygous variant. The ratio of these two likelihoods is called LOD (Log Odds) score and when it exceeds a decision threshold, the mutation can be classified as somatic. For this filtering, we considered only sites that had total read depth greater or equal than 8 in the normal sample and greater or equal than 14 in the tumor sample. Our final candidate list consisted in the union of MuTect and GATK-LODN results.

The variants were annotated by ANNOVAR [17], with the Ensembl Gene annotation database for human genome build 37 (http://www.ensembl.org/), and searched for matches in the dbSNP138 and 1000 Genomes data. We selected exonic single nucleotide variants (SNVs) that were non-synonymous and gain or loss of stop codon. Variants present in dbSNP138 and 1000 Genomes with minor allele frequency (MAF) greater than 0.05 were removed. Figure 1 shows the summary of the pipeline steps. The scripts for running the main pipeline steps are availabe in the link: https://bitbucket.org/BBDA-UNIBO/wes-pipeline.

Fig. 1
figure 1

Pipeline of SNV detection in sequencing data of cancer samples. Summary of steps and their respective tools in the detection of SNVs in paired normal-cancer sequencing data

A subset of variants from MuTect, GATK and GATK-LODN calls were selected for validation. Variants with allelic frequency higher than 0.2 were validated by Sanger Sequencing and those with allelic frequency lower than 0.2 were validated by using the Illumina TruSight Myeloid Sequencing Panel and Illumina MiSeq sequencing. Data were analyzed by the VariantStudio software (Illumina), according to manufacturer’s instruction.

Pipeline testing

As MuTect eventually miscalled variants already profiled by Sanger sequencing at the moment of diagnosis, we tested adapting the MuTect algorithm by lowering its two main parameters and thresholds – ΘT > = 6.5 and ΘN|dbSNP site > = 5.5 – that determine the mutation detection and classification as somatic or germ-line. We calculated the ΘT and ΘN values for each variant in the GATK raw output and set the thresholds to the minimum values that would permit the correct classification of 10 variants previously identified by Sanger sequencing.

We simulated datasets to evaluate the specificity and sensitivity of the three variant calling methods: MuTect, GATK and GATK-LODN. The specificity was evaluated by splitting the sequencing data of the same sample in two, applying the three variant calling methods, and counting the number of total SNVs called. One saliva sample of our AML cohort (80X) had its reads randomized (reads sorted by query name) and it was split in two by using the bamutils tool of NGSUtils package [18]. The resultant alignment files were applied to each variant calling method. The sensitivity was calculated by creating artificial tumor samples, applying the variant calling methods, and counting the number of true positives called. We adapted the mutate_sample.py script from the Shimmer package [19] to create mutations in a saliva sample alignment. Three artificial tumors were created with 22, 25 and 25 SNVs, which had variant allelic fractions range of 0.02 to 0.25, 0.5 to 0.86, and 0.97 to 1.0, respectively (Table 1). For each artificial tumor sample, we created subsets by randomly excluding reads and simulated sequencing coverages in the range of 5X to 80X, with intervals of 5X. The creation of the subsets was performed by the DownsampleBam tool of Picard. We then evaluated the performance of each variant calling method at different coverage levels.

Table 1 Artificial tumor samples. Coordinate list of the single nucleotide variants inserted in the artificial tumor samples and their variant allelic frequencies

Results

We built a pipeline for discovery of single nucleotide variants (SNVs) in whole exome sequencing data and applied it to Acute Myeloid Leukemia (AML), Acute Lymphoid Leukemia (ALL), Gastrointestinal Stromal Tumor (GIST), and Lung Adenocarcinoma samples. First, we compared the results of the three variant calling procedures: MuTect, GATK, and GATK-LODN. GATK detected 3 to 20 times more SNVs than MuTect (Fig. 2a) and the results for the Lung Adenocarcinoma dataset presented the highest concordance (30 %) between the two methods. GATK-LODN strongly reduced the number of SNVs in GATK results for the hematological tumors (Fig. 2b). For the solid tumors, approximately 10 % of GATK specific SNVs remained after applying GATK-LODN, and, for the GIST dataset, it detected about three times more variants than MuTect.

Fig. 2
figure 2

The GATK-LODN method reduces the number of GATK false positive calls. Comparison of the number of SNVs between GATK and MuTect before (a) and after (b) applying the GATK-LODN method for each cancer whole exome sequencing dataset. AML: Acute Myeloid Leukemia, ALL: Acute Lymphoblastic Leukemia, GIST: Gastrointestinal Stromal Tumor, LA: Lung Adenocarcinoma

The MuTect algorithm has two main parameters: ΘT and ΘN. We calculated these values for a set of variants candidates (AML dataset) from GATK results and tested if we could reduce the number of false negatives by lowering these thresholds. We set the two parameters for ΘT > = 4.5 and ΘN|dbSNP site > = 3 and it permitted the detection of 10 variants previously profiled by Sanger sequencing, but not detected by the original MuTect analysis. However, the number of final candidates increased about 1.3 to 10 times in comparison with the original MuTect output (Table 2).

Table 2 Relaxing MuTect parameters increases the number of false positive calls. Number of variants found by MuTect, before and after relaxing the ΘT and ΘN parameters for six Acute Myeloid Leukemia (AML) normal-cancer sample pairs

We selected a set of candidate variants from the AML dataset and performed the validation experiment of each method in two rounds. In the first, we tested just the tumor samples, in order to evaluate the performance of each method in detecting the mutations. In the second round, we tested both tumor and normal samples, in order to evaluate the performance of each method in classifying mutations as somatic events. We observed that 18 out of 48 and 5 out of 18 GATK variants were correctly detected and classified, respectively, while MuTect presented high performance in both rounds (6 out of 7 and 2 out of 3, respectively). The GATK-LODN presented better validation rates than GATK for both mutation detection (18 out of 48 to 6 out of 9) and classification (5 out of 14 to 3 out of 4) (Table 3).

Table 3 The GATK-LODN method increases the GATK performance for both mutation detection and classification. The Sanger sequencing validation was performed in two rounds: in the first round we tested whether the methods correctly detected the mutation and in the second one we assessed whether the methods correctly classified the mutations as somatic events. The variant subsets tested (AML datatset) presented variants method specific and variants detected by one or more methods

Simulated data permitted the evaluation of sensitivity and specificity of the three variant calling methods. We measured the specificity by splitting a saliva sample alignment (80X) in two, applying to the pipeline and counting the number of called SNVs. Mutect, GATK, and GATK-LODN resulted in 8, 76 and 35 false positives, respectively. Then, we applied technical replicates of the same saliva sample to the pipeline and it resulted in 7, 84 and 33 false positives, respectively. We measured the sensitivity by simulating three artificial tumors with different Variant Allelic Frequency (VAF) ranges: one with high-frequency variants (n = 25, VAF: 0.97 to 1.0), one with intermediate-frequency variants (n = 25, VAF: 0.5 to 0.86), and another with low-frequency variants (n = 22, VAF: 0.02 to 0.25). MuTect presented a Positive Predictive Value (PPV) of 19/22 for low VAF mutations and its false negatives were composed by: one variant with VAF = 0.02, and two variants that had either VAF < 0.1 and total read depth smaller than 24 (Table 4). GATK presented the smallest performance for somatic variants, since it detected 2206 candidates out of 22 or 25 true positive variants. GATK-LODN presented a PPV of 17/22 for the low allelic frequency variants, but it missed variants with VAF < 0.095 (Table 4). MuTect detected all intermediate and high allelic frequency variants, while GATK-LODN presented PPVs of 23/30 and 23/31, respectively (Table 4).

Table 4 The GATK-LODN method presented good performance in artificial tumor samples. Performance of MuTect and GATK-LODN for artificial tumor samples that had variants with diverse allelic frequencies

For each artificial tumor, we simulated different sequencing coverages and evaluated the number of false negatives and true positives detected. We observed that, at different coverage levels, GATK-LODN and MuTect presented almost identical performance for the artificial tumors with high and intermediate variant frequency SNVs, except in the number of false negatives detected by GATK-LODN in the coverage interval of 5 to 20X. GATK-LODN presentedincreased number of detected true positives than MuTect in the coverage interval of 50 to 55X for high and intermediate-frequency variants, and in the coverage 20X for low-frequency variants (Fig. 3).

Fig. 3
figure 3

Number of False Negatives and True positives at different coverage levels. Three artificial tumors were created with 22, 25 and 25 SNVs, which had variant allelic fractions range of 0.02 to 0.25, 0.5 to 0.86, and 0.97 to 1.0, respectively. We counted the number of False Negatives (FN) and True positives (TP) for different levels of simulated sequencing coverage

Discussion

Our data show that the combination of standard tools - Genome Analysis Toolkit (GATK) and MuTect – improves the range of detected single nucleotide variants (SNVs) in whole exome sequencing data of cancer samples. We also developed the GATK-LODN method, which reduced the number of GATK false positive calls. Our study has the advantage of actually combining two different algorithms rather than proposing ways of unifying results of different tools [9, 20]. As one method originally presented high amounts of false positive calls (type I error) and the other high amounts of false negative calls (type II error), the GATK-LODN is an option of amplifying the range of detected SNVs without severely compromising sensitivity and specificity.

The GATK uses a Bayesian model to estimate the likelihood of a genotype given the observed sequence reads that cover the locus. It independently calls genotypes in tumor and normal samples, being the somatic mutations classified as those only present in the tumor sample. However, GATK detects many false positives likely due to germ-line variants with low sequencing coverage or low allelic frequency, that are not called in the normal samples. MuTect jointly analyzes tumor and normal samples, presenting high sensitivity, specificity and validation rates. Each method detects variants that the other does not detect, and a previous study demonstrated that the SNVs found only by GATK had relatively high validation rates [4]. One option would be taking into account just the results obtained from one tool, but it risks the selection of errors for which the algorithm is vulnerable [21]. Another option would be taking the intersection of multiple variant callers, but it will result in high false negative rates, since each tool uniquely identifies true variants [4]. We discarded the option of relaxing the MuTect parameters, since we observed that it included the detection of variants previously miscalled, but with the cost of including many false positives. Our study demonstrates the advantage of merging the results of MuTect and GATK-LODN, since GATK-LODN reduces the number of GATK false positives and detect variants not detected by MuTect. The GATK-LODN increased the performance of GATK in the sequencing validation experiments and in the simulated artificial tumor samples. We observed that the GATK-LODN also outperformed MuTect in some simulated sequencing coverages. As sequencing datasets usually present large variability in coverage and quality, the different error modeling approaches and prior assumptions associated to the two methods should permit good performances in a wide scenario.

We performed the validation experiments just for variants from the hematological tumors (available in our laboratories), thus the validation rate might change for solid tumors. The results show that GATK-LODN filtered more variants in the hematological tumors than in the solid tumors and we hypothesized that the normal samples from hematological tumors may be more prone to contamination by cancer cells. Although GATK-LODN provided a small number of variants in the hematological datasets, even a single variant can give insights into the mechanisms of malignant transformation and help design personalized therapeutic approaches [22, 23]. We observed that the Lung Adenocarcinoma presented the biggest concordance between methods, maybe because patients with this type of cancer usually presents high mutation frequencies and harbors more somatic mutations compared with other cancer types [2427]. The results also show that different methods may present bias to certain nucleotide substitution mutations, but more studies involving larger groups of tumors are needed.

The GATK-LODN is suitable for application together with other post-calling filtering features as: strand bias, nearby polymorphisms and technology specific sequencing errors removal [2830]. For instance, Carson et al. [7] suggested new thresholds for genotype and variant filters to be used in conjunction with the GATK pipeline analysis, that could increase the GATK-LODN performance in population-based studies. Altogether, the GATK-LODN allows enough flexibility to deal with different study designs and requirements about how stringent the analysis must be.

Here, we presented a tested pipeline that combines standard tools, aiming to detect a wide range of somatic single nucleotide variants with high specificity and sensitivity. We developed the GATK-LODN method, which can be helpful in large-cohort discovery studies aimed to profile the somatic mutational landscape from whole exome sequencing data of cancer samples.

Conclusion

Next generation sequencing analysis has drastically improved the biological knowledge of human cancers. Several tools and strategies are available to detect single nucleotide variants in normal-cancer paired samples, but many research groups report low concordance among them. In this study, we proposed a pipeline that applies two standard tools (MuTect and GATK) and one adapted method (GATK-LODN) that increased the performance of its original algorithm. The GATK-LODN method improved the overall performance by reducing the number of false positive calls and permitted the detection of variants not detected by MuTect. We believe that the proposed pipeline will help in the understanding of cancer biology through the discovery of somatic single nucleotide variants in cancer sequencing data.