Introduction

Sanger sequencing has dominated the genomic research for the past two decades and achieved a number of significant accomplishments including the completion of human genome sequence, which made the identification of single gene disorders and the detection of targeted somatic mutation for clinical molecular diagnostics possible [1, 2]. Despite Sanger sequencing's accomplishments, researchers are demanding for faster and more economical sequencing, which has led to the emergence of “next-generation” sequencing technologies (NGS). NGS’s ability to produce an enormous volume of data at a low price [3, 4] has allowed researchers to characterize the molecular landscape of diverse cancer types and has led to dramatic advances in cancer genomic studies.

The application of NGS, mainly through whole-genome (WGS) and whole-exome technologies (WES), has produced an explosion in the context and complexity of cancer genomic alterations, including point mutations, small insertions or deletions, copy number alternations and structural variations. By comparing these alterations to matched normal samples, researchers have been able to distinguish two categories of variants: somatic and germ line. The Whole transcriptome approach (RNA-Seq) can not only quantify gene expression profiles, but also detect alternative splicing, RNA editing and fusion transcripts. In addition, epigenetic alterations, DNA methylation change and histone modifications can be studied using other sequencing approaches including Bisulfite-Seq and ChIP-seq. The combination of these NGS technologies provides a high-resolution and global view of the cancer genome. Using powerful bioinformatics tools, researchers aim to decipher the huge amount of data to improve our understanding of cancer biology and to develop personalized treatment strategy. Figure 1 shows the workflow of integrating omics data in cancer research and clinical application.

Figure 1
figure 1

The workflow of integrating omics data in cancer research and clinical application. NGS technologies detect the genomic, transcriptomic and epigenomic alternations including mutations, copy number variations, structural variants, differentially expressed genes, fusion transcripts, DNA methylation change, etc. Various kinds of bioinformatics tools are used to analyze, integrate, and interpret the data to improve our understanding of cancer biology and develop personalized treatment strategy.

Cancer research

In the last several years, many NGS-based studies have been carried out to provide a comprehensive molecular characterization of cancers, to identify novel genetic alterations contributing to oncogenesis, cancer progression and metastasis, and to study tumor complexity, heterogeneity and evolution. These efforts have yielded significant achievements for breast cancer [512], ovarian cancer [13], colorectal cancer [14, 15], lung cancer [16], liver cancer [17], kidney cancer [18], head and neck cancer [19], melanoma [20], acute myeloid leukemia (AML) [21, 22], etc. Table 1 summarizes the recent advances in cancer genomics research applying NGS technologies.

Table 1 Recent NGS-based studies in cancer

Discovery of new cancer-related genes

Cancer is primarily caused by the accumulation of genetic alterations, which may be inherited in the germ line or acquired somatically during a cell’s life cycle. The effects of these alterations in oncogenes, tumor suppressor genes or DNA repair genes, allows cells to escape growth and regulatory control mechanisms, leading to the development of a tumor [23]. The progeny of the cancer cell may also undergo further mutations, resulting in clonal expansion [24]. As clonal expansion continues, clones eventually become invasive to its surrounding tissue and metastasize to distant areas from the primary tumor [25].

The sequencing of cancer genomes has revealed a number of novel cancer-related genes, especially in breast cancer. Recently, six papers reported their findings on large breast cancer dataset: TCGA performed exome sequencing on 510 samples from 507 patients [5], Banerji et al. conducted exome sequencing on 103 samples and whole genome sequencing on 17 samples, Ellis et al. did exome sequencing on 31 samples and whole genome sequencing on 46 samples [7], Stephens et al. applied exome sequencing on 100 samples, Shah et al. performed whole genome/exome and RNA sequencing on 65 and 80 samples of triple-negative breast cancers [11], and Nik-Zainal et al. performed whole genome sequencing on 21 tumor/normal pairs [12]. Besides confirming recurrent somatic mutations in TP53, GATA3 and PIK3CA, these studies discovered novel cancer-related mutations. Although novel mutations occur at low frequency (less than 10%), mutations of specific genes are enriched in the subtype of breast cancers and could be grouped into cancer-related pathways. For example, mutations of MAP3K1 frequently occur in luminal A subtype [5, 7]. Pathways involving p53, chromatin remodeling and ERBB signaling are overrepresented in mutated genes [11]. Furthermore, some mutations indicate therapeutic opportunities such as the mutant GATA3, which might be a positive predictive marker for aromatase inhibitor response [7].

Genomic sequencing has also helped characterize the mutation profile of colorectal cancer. For example, exome sequencing performed on 72 tumor-normal pairs identified 36,303 protein-altering somatic mutations. Further analysis for significantly mutated genes led to 23 candidates that included expected cancer genes such as KRAS, TP53 and PIK3CA and novel genes such as ATM, which regulates the cell cycle checkpoint. RNA sequencing identified recurrent R-spondin fusions, which might potentiate Wnt signaling and induce tumorigenesis [15]. Another example includes exome sequencing performed on 224 tumor and normal pairs. This study identified 15 highly mutated genes in the hypermutated cancers and 17 in the non-hypermutated cancers. Among the non-hypermutated cancers, novel frequent mutations in SOX9, ARID1A, ATM and FAM123B were detected besides the known APC, TP53 and KRAS mutations. The analysis of the mutations and functional roles of SOX9, ARID1A, ATM and FAM123B suggested they are highly potential colorectal cancer-related genes. Non-hypermutated colon and rectum cancers were found to have similar patterns in genomic alternation. Whole genome sequencing of 97 tumors with matched normal samples identified the recurrent NAV2-TCF7L1 fusion [14].

Tumor heterogeneity and evolution

What makes cancer a difficult disease to conquer has much to do with the evolution of cancer that results from the selection and genetic instability occurring in each clone, leading to heterogeneity in tumors [26]. This idea was first proposed by Peter Nowell in 1976 as the clonal evolution model of cancer, which attempted to explain the increase in tumor aggressiveness over a period of time. Further work by other researchers in the 1980s supported this theory with studies of metastatic subclones from a mouse sarcoma cell line [26].

The wide application of NGS has revealed substantial insights into tumor heterogeneity and tumor evolution. Variations between tumors are referred to as intertumor heterogeneity, while variations within a single tumor are intratumor heterogeneity. Intertumor heterogeneity is recognized by different morphological phenotype, expression profiles and mutation and copy number variation patterns, categorizing tumors into different subtypes [2731]. The mRNA-expression subtype was found to be associated with somatic mutation landscapes in the recent TCGA and Eillis et al.’s studies. [5, 7]. As a huge amount of somatic mutations generated by NGS, the picture emerges like that individual tumor is unique, each containing distinct mutation patterns. For instance, Stephens et al. found that there were 73 different combination possibilities of mutated cancer genes among the 100 breast cancers [9].

Intratumor heterogeneity can be recognized as non-identical cellular clones or subclones within a single tumor, indicating different histology, gene expression, and metastatic and proliferative potential. The ability to generate high-resolution data makes NGS a particularly useful tool for studying intratumor heterogeneity. A recent NGS-based study on renal cell carcinoma from four patients has successfully illuminated intratumor heterogeneity [18]. For patient 1, the pre-treatment samples of the primary tumor and chest-wall metastasis went through exon-capture multi-region sequencing on DNA. Of the 128 validated mutations found in 9 regions of the primary tumor, 40 were ubiquitous, 59 were shared by some regions, and 29 were unique to specific regions, showing that genetic heterogeneity exists within a tumor and an “ongoing regional clonal evolution” [18]. Most importantly, the study showed that a single biopsy of a tumor only reveals a small part of a tumor’s mutational landscape; from a single biopsy, about 55% of all mutations were detected in this tumor and 34% were shared by most regions of the tumor.

The ongoing and parallel evolution of cancer cells may establish and maintain intratumor heterogeneity. For example, phylogenetic relationships of the tumor regions in patient 1 and 2 by the renal cell carcinoma study revealed a branching rather than linear evolution of the tumor [18]. Studies have also shown branching structures of evolution in breast cancer [26]. According to the “Trunk-Branch Model of Tumor Growth” [26], there are somatic events that promote tumor growth, which represents the trunk of the tree in the early stage of tumor development. These somatic aberrations would most likely be ubiquitous at this stage. Over time, other somatic events, known as drivers, cause tumor heterogeneity to occur, which causes branching to take place in tumors as well as in metastatic sites. Later, these branches will evolve and become more isolated, resulting in a ‘Bottleneck Effect’ that can result in chromosomal instability, allowing further expansion of tumor heterogeneity [26]. This leads to the tumor’s ability to adapt and survive in changing environments, which affects the success of drug treatment [18]. Therefore, it is important to examine tumor clonal structure and identify common mutations located in the trunk of the phylogenetic tree, which may help understand target therapy resistance and discover more robust therapeutic approaches.

Clinical application

Besides allowing researchers to understand mutations in cancer, NGS has already been applied to the clinic in many areas including prenatal diagnostics, pathogen detection, genetic mutations, and more [32]. Although genetic mutations have been identified with Sanger sequencing, PCR, and microarrays in clinical application, these three have limitations that don’t apply to NGS. For example, although microarrays can detect single nucleotide variants (SNVs), they have trouble identifying larger DNA aberrations, e.g., large indels and structural rearrangements, which are common in cancer. In contrast, whole exome and whole-genome sequencing can provide the clinician a comprehensive view of the DNA aberrations, genetic recombination, and other mutations [28, 32]. Therefore, NGS platforms serve as a good diagnostic and prognostic tool and help clinicians identify specific characteristics in each patient, paving the road towards personalized medicine.

NGS has already been applied in the clinic for cancer diagnosis and prognosis. For example, whole genome sequencing identified a novel insertional fusion that created a classic bcr3 PML-RARA fusion gene for a patient with acute myeloid leukemia and the findings altered the treatment plan for the patient [33]. By sequencing the tumor genome of a patient, clinicians are able to design patient-specific probes that uses DNA in the patient’s blood serum to monitor the progress of a patient’s treatment and detect for any signs of relapse [2731]. The discovery of more biomarkers and the development of target-therapies will be essential in helping a clinician choose the best personalized treatment for his or her patients.

There has also been a dramatic increase in the number of clinical trials using NGS technologies since 2010 (Table 2). Ranging from WGS and WES to RNA-seq and targeted sequencing, clinical trials are using NGS to find genetic alterations that are the drivers of certain diseases in patients and apply that knowledge into the practice of clinical medicine. The information gained from these studies may help with drug development and explain the resistance of certain treatments.

Table 2 Active cancer studies using NGS as the primary outcome measure

Methods and resources

Pipeline and tools for NGS data analysis

To analyze and interpret the increasing amount of sequencing data, a number of statistical methods and bioinformatics tools have been developed. For WGS and WES, the analysis generally includes read alignment, variant detection (point mutation, small indels, copy number variation and structural rearrangement) and variant functional prediction (Table 3). Reads are mapped back to the human reference genomes using MAQ [34], BWA [35, 36], Bowtie2 [37], BFAST [38], SOAP2 [39], Novoalign/NovoalignCS, SSAHA2 [40], SHRiMP [41], etc. These methods differ in their computational efficiency, sensitivity and ability to accurately map noisy reads, to deal with long or short reads and pair-end reads. Having aligned the reads to the genome, mutation calling identifies the sites in which at least one of the bases differs from a reference sequence by GATK [42], SAMtools [43], SOAPsnp [44], SNVMix [45], Varscan [46], etc. Differing in the underlying statistical models, the performances of these methods are comparable and vary on sequencing depths [4749]. Detecting somatic mutation involves mutation calling in paired tumor-normal DNA, coupled with comparison to the reference. A naïve somatic mutation caller applies standard calling tools on the normal and tumor samples separately and then selects mutations detected in tumor but not in normal. Alternatively, a complicated caller jointly analyzes tumor-normal pair data such as Varscan2 [50], Somaticsniper [51] and JointSNVMix [52]. SIFT [53], PolyPhen [54], CHASM [55] and ANNOVAR [56] have been developed to understand the impact of the mutations on gene function and to distinguish between driver and passenger mutations. For WGS, various kinds of structural variations can be discovered using BreakDancer [57], VariationHunter [58], PEMer [59] and SVDetect [60]. RNA-seq data analysis generally includes reads alignment, gene expression quantification, differentially expressed genes/isoforms or alternative splicing detection and novel transcripts discovery (Table 4). There are two major approaches to map RNA-seq reads. One is to align reads to the reference transcriptome using standard DNA-seq reads aligner. The alternative is to map reads to the reference genome allowing for the identification of novel splice junctions using a RNA-seq specific aligner, such as TopHat [61], MapSplice [62], SpliceMap [63], GSNAP [64], and STAR [65]. Having aligned reads, expression values are quantified by aggregating reads into counts and differential expression analysis is performed based on counts (DEseq [66],edgeR [67]) or FPKM/RPKM values (CuffLinks [68, 69]). Estimating isoform-level expression is very difficult since many genes have multiple isoforms and most reads are shared by different isoforms. To deal with read assignment uncertainty, Alexa-seq [70] counts only the reads that map uniquely to a single isoform, while Cufflinks [68, 69] and MISO [71] construct a likelihood model that best explains all the reads obtained in the experiment. In addition, fusion transcripts can be detected using SOAPfusion, TopHat-Fusion [72], BreakFusion [73], FusionHunter [74], deFuse [75], FusionAnalyser [76], etc. To obtain a more complete view of cancer genome, an integrative approach to study diverse mutations, transcriptomes and epigenomes simultaneously on the pathways or networks is much more informative and promising. A growing number of pathway-oriented tools is now becoming available, including PARADIGM [77], NetBox [78], MEMo [79], CONEXIC [80], etc.

Table 3 Computational tools for cancer genomics
Table 4 Computational tools for cancer transcriptomics

Comprehensive cancer projects and resources

The vast amount of oncogenomics data are generated from large scale collaborative cancer projects (Table 5). The Cancer Genome Atlas (TCGA) and International Cancer Genome Consortium (ICGC) are the two largest representatives of such coordinated efforts. Beginning as a three-year pilot in 2006, TCGA aims to comprehensively map the important genomic changes that occur in the major types and subtypes of cancer. TCGA will examine over 11,000 samples for 20 cancer types (http://cancergenome.nih.gov/). ICGC launched in 2008 and its goal is ‘to obtain a comprehensive description of genomic, transcriptomic and epigenomic changes in 50 different tumor types and/or subtypes which are of clinical and societal importance across the globe’(http://icgc.org/icgc). The Cancer Genome Project (CGP) has many efforts at the Sanger Institute and aims to identify sequencevariants/mutations critical in the development of human cancers (http://www.sanger.ac.uk/genetics/CGP/). The NCI’s Cancer Genome Anatomy Project (CGAP) seeks to determine the gene expression profiles of normal, precancer and cancer cells, leading eventually to improved detection, diagnosis and treatment for the patient (http://cgap.nci.nih.gov/). Recently, the Clinical Proteomic Tumor Analysis Consortium (CPTAC) has launched to systematically identify proteins that derive from alterations in cancer genomes using proteomic technologies (http://proteomics.cancer.gov/). The combination of genomic and proteomic initiatives is anticipated to produce a more comprehensive inventory of the detectable proteins in a tumor and advance our understanding of cancer biology.

Table 5 Comprehensive cancer projects and resources

The data and the results from these projects are freely available to the research community (Table 5). A number of databases and frameworks have been developed to make the data and the results easily and directly accessible. For example, the results from CGP are collated and stored in COSMIC [83]. The cBio Cancer Genomics Portal, containing dataset from TCGA and published papers, is specifically designed to interactively explore multidimensional cancer genomics data, including mutation, copy number variations, expression changes (microarray and RNA-seq), DNA methylation values, and protein and phosphoprotein levels [84]. Intogen is also a framework that facilitates the analysis and integration of multimensional data for the identification of genes and biological modules critical in cancer development [85]. The Broad GDAC Firehose, designed to coordinate the various tools utilized by TCGA, provides level 3 and level 4 analyses and enables researchers to easily incorporate TCGA data into their projects. Table 5 also includes resources useful for cancer research but not built on NGS data, e.g., Progenetix [86].

Challenges and perspective

Although NGS has already helped researchers discover a plethora of information in the field of cancer, challenges in translating the large amounts of oncogenomics data into information that can be easily interpretable and accessible for cancer care still lie ahead. From a computational point of view, many technical and statistical issues remain unsolved. For example, repetitive DNA represents a major obstacle for the accuracy of read alignment and assembly, as well as structure variation detection [87]. Furthermore, it is difficult to distinguish rare mutations in tumor from sequencing and alignment artifacts, especially when a tumor has low purity. Despite new methods to comprehensively catalogue genomic variants, the prediction of their functional effect and the identification of disease-causal variants are still in an early phase [88]. Current algorithms for quantifying isoform expression are not computationally trivial and are incredibly difficult to explain. Although the concept of integrative analysis is not new, predictive networks or pathway models that combine various omics data are still underway. Most importantly, since sequencing technologies and methodologies are both evolving rapidly, it is a difficult challenge to store, analyze and present the data in a method that is transparent and reproducible [89]. On the other hand, tumor complexity and heterogeneity make the analysis and the interpretation of sequencing data even harder. Heterogeneity is dynamic and evolves over time. This challenges the simple notion of binning mutations as tumorigenesis ‘driver’ and neutral ‘passenger’, since some passengers are also drivers just waiting for the right context [90].

From a clinical point of view, a major challenge is to assess genomic variants as potential therapeutic targets. Although many diverse variants are demonstrated to converge on similar deregulated pathways, there is still a lack of pathway-targeted therapies. With the discovery of intra-tumor heterogeneity, questions have been raised about how well a glimpse of a tumor’s genomic landscape can steer the treatment. Currently, many clinicians decide a treatment based on the genetic markers from a few biopsies. Whether these markers are over- or under-represented in the tumor is unknown, causing the selection of treatment to be difficult [29]. In addition to heterogeneity, the tumor’s ability to evolve allows it to have more opportunities to adapt and survive to various treatments. Some researchers hope that with current target therapies, intratumor heterogeneity will decrease to a certain point [29] so that clinicians can then target the non-responsive clones before a tumor re-growth and more mutations can occur; however, choosing an appropriate target therapy will be a challenge. A few researchers have already shown certain treatments, such as the cytotoxic therapies, that have increased genome instability and diversity, resulting in a faster tumor evolution rate and, thus, heterogeneity. The fact is that this area of cancer is understudied [26]; however, one of the key challenges researchers must solve is identifying branched subclones are resistant to which target therapies. More knowledge of network medicine and the interaction between the trunk and branch mutations may lead to appropriate target therapies and personalized therapeutic strategies that can prevent drug resistance and effectively eradicate cancer [26, 91].

To accelerate the rate of translating genomic data into clinical practice, a sustained collaboration among multiple centers and effective communication among bioinformaticians, statistical geneticists, molecular biologists and physician are required. Bioinformaticians and statistical geneticists are responsible for providing reproducible and accurate analysis, identifying ‘drivers’ in the unstable and evolving cancer genome and building powerful and flexible integrative model to consider interactions among genomic, transcriptomic, metabolomics, proteomics and epigenomic alterations in the context of tumor microenvironment. Biologists interpret and confirm the functional relevance of variants to cancer. Physicians assess relationships of variants to cancer prognosis and response to therapy. Appropriate infrastructure within each research institution that integrates the clinic for patient samples, wet lab for sequencing, and Bioinformatics for data analysis should allow the sequenced data to be processed efficiently, producing results that can create effective personalized therapies applicable to the clinic. In addition, easily accessible and understandable databases that connect genomic findings with clinical outcome are also required. With these efforts and developments, NGS will greatly potentiate genome-based cancer diagnosis and personalized treatment strategies.