Copy number variations: dynamic genomes

Our genomes are not the stable places we once thought they were. Recent genome-wide studies have shed light on copy number variations (CNVs), an unexpectedly frequent, dynamic and complex form of genetic diversity, and have quickly overturned the idea of a single diploid human 'reference genome'. Although the characterization of the extent and location of these regions in healthy genomes is far from complete, many groups, including ours, are actively trying to determine the clinical impact of CNVs in patient populations.

CNVs are structurally variant regions in which copy number differences have been observed between two or more genomes [1]. Defined as being larger than 1 kilobase (kb) in size, CNVs can involve gains or losses of genomic DNA that are either microscopic or submicroscopic and are, therefore, not necessarily visible by standard G-banding karyotyping. Until recently, only a few copy-number-variable loci had been identified, such as duplications at the α7-nicotinic receptor gene (CHRNA7) at 15q13-15 [2] and variation at the major histocompatibility complex locus [3]. In 2004, significant advances in DNA array technology enabled the discovery of many CNVs, revealing a novel and pervasive form of inter-individual genomic variation [4, 5]. These pioneering genome-scale efforts used two different platforms to find 76 CNVs in 20 individuals [5] and 255 CNVs in 55 individuals [4], some of which were common to both studies, suggesting possible hotspot regions of CNVs in the human genome. Even this was soon found to be an under-representation of the number of CNVs; follow-up studies have since ascertained many thousands of CNV regions in hundreds of healthy individuals. In fact, the recent increase in scientific interest in CNVs, combined with improvements in micro-array fabrication (higher density at lower cost) and the development of new informatics techniques, have led to the ascertainment of approximately 21,000 CNVs, or around 6,500 unique CNV loci, in the five short years since this form of genetic variation was first revealed (these figures come from the March 2009 update of the Database of Genomic Variants (DGV) [4]). CNVs are now thought to cover at least 10% of the human genome. Furthermore, next-generation sequencing technologies will soon be used to sequence thousands of genomes along with their CNVs.

Figure 1
figure 1

Distribution of common cancer CNVs in the human genome. The chromosomes containing common cancer CNVs in the human genome are shown, with centromeric regions in red (using data from [19]) and Giemsa banding patterns in white, grey or black. Loci are in green if they were found to contain a cancer-related gene that is overlapped or encompassed by a CNV (as found by [18]).

CNVs and disease: mutable genomes

The CNV map for the human genome is being continuously refined and has already pinpointed the location, copy number, gene content, frequency and approximate breakpoints of numerous CNVs in the healthy population. These structural variants can alter transcription of genes by altering dosage or by disrupting proximal or distant regulatory regions, as has been shown globally in the healthy human [6], mouse [7] and rat genomes [8]. It is, however, the specific disease-associated CNV loci that have been particularly scrutinized and that therefore provide the most detailed examples of how CNVs can alter cellular function. We will highlight three insights in particular from the literature: that pathogenic CNVs often contain multiple genes, that the effect of a pathogenic CNV is not limited to the gene(s) it contains, and that pathogenic CNVs can have reciprocal deletions/duplications.

The number of genes in pathogenic CNVs

Genomic rearrangements give rise to a variety of diseases classified as 'genomic disorders' [9]. Because they involve large regions, it is common for genomic disorders to include many deleted or duplicated genes, unlike traditional mutations that affect a single coding-region change of one gene. These genes can be either fully encompassed or partially overlapped by the pathogenic CNV. Deletions of 22q11.2 are associated with DiGeorge/velocardiofacial syndrome and include the catechol-O-methyltransferase gene, the T box transcription factor 1 gene and others [10]. Similarly, the autosomal dominant Prader-Willi syndrome (15q11-q13 deletion) involves many genes [11], and the Williams-Beuren syndrome (7q11.23 deletion) involves 28 genes [12]. As microarray resolution increases, genomic disorders will certainly be found that are caused by small CNVs involving only a single gene, or even a portion of one gene.

The source of the effect of a pathogenic CNV

Usually, the genes contained in the pathogenic CNV are candidates for association with the clinical phenotype under study. However, research on genomic disorders has shown that some genes within a CNV may not be necessary, or may not be sufficient, to cause the observed disease. For example, a recurrent 3.7 Mb microdeletion is responsible for 70% of cases of Smith-Magenis syndrome (SMS) [13], a neurobehavioral disorder involving sleep disturbance, craniofacial and skeletal anomalies, intellectual disability and distinctive behavioral traits. Although the size of the deletions observed varies, the identification of a common 'critical region' (1.5 Mb) in SMS patients led to the conclusion that the retinoic acid induced 1 (RAI1) gene alone is responsible for most SMS features. Indeed, RAI1 point mutations have been seen in patients without deletions with similar phenotypes, thus confirming that this gene (of the 13 in the critical region) is necessary to cause SMS. Patients with additional genes deleted have a variable and more severe phenotype. In contrast, in Williams-Beuren syndrome, not only the aneuploid genes but also genes far outside the deleted region have reduced expression and are thought to contribute to the phenotype [14]. Such long-range influence of CNVs on distant gene expression is proposed to be caused by positional effects [15].

Reciprocal deletions and duplications

Recombination between highly homologous sequences (non-allelic homologous recombination) can generate deletions, duplications, inversions and translocations. The sequence architecture that allows one copy number change can also allow its reciprocal at the same locus. The reciprocal events usually cause different phenotypes and occur at different frequencies in the population and at different rates during meiosis [16].

CNVs and cancer predisposition: first hits to the tumor genome

The goal of cancer genetics is to discover all variant alleles that predispose to neoplasms. To this end, single nucleotide polymorphisms (SNPs) have been the most widely studied form of genetic variation and, by using massive whole-genome studies (genome-wide association (GWA) studies), many common SNPs have been shown to be associated with cancer and other complex traits. However, the results of these efforts have not explained much of the heritability of disease [17]. This is perhaps because GWA studies have mostly ignored the inter-individual genetic variation provided by CNVs, which affect more than 10% of the human genome. CNVs, especially smaller variants, have been essentially hidden from view until recently; thus, only a handful of studies have found an association of CNVs with cancer. Once these CNVs have been identified, one can only assume that CNVs will explain a larger portion of the genetic basis of cancer. Once identified, common and rare CNVs should be considered separately, as they may have very different roles in cancer.

Common cancer CNVs

As with SNPs, CNVs that are found frequently in the healthy population (common CNVs) are very likely to have a role in cancer etiology. In the only study published so far that begins to test the hypothesis that common CNVs are associated with malignancy, we [18] created a map of every known CNV whose locus coincides with that of bona fide cancer-related genes (as catalogued by [19]); we called these cancer CNVs. In an initial analysis [18], we examined 770 healthy genomes using the Affymetrix 500 K array set, which has an average inter-probe distance of 5.8 kb. As CNVs are generally thought to be depleted in gene regions [20], it was surprising to find 49 cancer genes that were directly encompassed or overlapped by a CNV in more than one person in a large reference population (Figure 1). In the top ten genes, cancer CNVs could be found in four or more people. In this analysis only CNVs directly overlapping a cancer gene were selected (either both breakpoints were inside the genomic interval containing the gene, both were outside the interval, or one breakpoint was inside while the other was outside). However, this is probably an underestimate of the actual number of common cancer CNVs, for two reasons. First, many smaller variants are missed at the resolution of this array: the mean size of CNVs found using the Affymetrix 500 K array is 206 kb [20], whereas the CNVs found using the newer Affymetrix 6.0 platform with a median inter-marker distance of less than 700 bp are 5-15 times smaller [21]. Second, as discussed above, there are unquestionably additional, more distal CNVs that have a long-range effect on cancer gene transcription levels.

Validating the initial observation [18], many of these genes are also found in the DGV, a curated list of CNVs compiled from numerous publications [4]. Analysis of the DGV [22] shows that nearly 40% of cancer-related genes are interrupted by a CNV. This trend continues: even among the ten most recent CNV publications in the DGV (those published after February 2008), many important tumor suppressor genes and oncogenes can be found with diverse functions, including apoptosis, control of cell cycle checkpoints and DNA repair, and numerous translocation and fusion gene partners. An example of this is Rad51L1, a gene that is a member of the RAD51 family; this is essential for DNA repair by homologous recombination and has been shown by a GWA study to contain a SNP that is strongly associated with breast cancer [23].

The challenge will be to determine which of these genes are dosage-sensitive and which tissues containing these common cancer CNVs will be susceptible to malignant transformation and growth. One approach is to characterize specific cancer CNVs in great detail, in terms of both population frequency and breakpoint sequence [24]. For example, in a pilot candidate-gene association study, we found a cancer CNV at the gene MLLT4 (a Ras target encoding a protein that regulates cell-cell adhesion) that seems to be associated with the Li-Fraumeni cancer predisposition disorder (LFS); individuals affected with LFS harbor a germline heterozygous mutation of the Tp53 tumor suppressor gene [18]. The frequency of this CNV is significantly increased in LFS (P = 0.006, Fisher's exact test): 3 of the 19 LFS probands (15.8%; observed/expected = 3/0.4 = 7.5) harbored the CNV duplication, whereas only 12 of 710 healthy individuals from the reference population (1.69%; observed/expected = 12/14.6 = 0.82) harbored the CNV.

A nice illustration of a focal CNV with phenotypic effect is given by the mitochondrial tumor suppressor gene (Mtus1); Frank et al. [25] found that a small deletion in Mtus1 is associated with a decreased risk of familial and high-risk breast cancer. Using long-range PCR, we independently fine-mapped this common cancer CNV and genotyped it in a panel of healthy controls. Although it is only 1.1 kb in size, the deletion removes an entire exon of Mtus1. Direct sequencing reveals a 41 bp stretch of homology flanking the exon, which leads to this deletion by non-allelic homologous recombination (Figure 2).

Figure 2
figure 2

Cancer CNV breakpoint mapping. We mapped a 1.1 kb deletion in the mitochondrial tumor suppressor gene, MTUS1, to base-pair resolution. The affected portion of the gene is shown, including an exon (blue) that is deleted in the presence of the CNV. Two 41 bp repeats (with sequence AAATAAGAACCAAGTCCAAATACATCTTTGGAATGAAAGAG) were found at the breakpoints (red), while the sequence of the junction fragment is shown in the chromatogram.

These examples demonstrate hypothesis-driven approaches, which are restricted to genes for which there is an a priori association with cancer. Ultimately, it will be important to be able to discover and test every CNV in a genome for cancer susceptibility, but although this hypothesis-free approach is becoming technically tractable and more economical, such studies do have unique analytical challenges. As elaborated upon elsewhere [26, 27] these challenges include: the unknown allele frequency and integer copy number of most CNVs, both within and among populations; the absence of sequence-level breakpoint information for most CNVs and the architectural complexity of some CNV regions, including smaller CNVs within larger ones [24].

Rare cancer CNVs

Common cancer SNPs - and by analogy common cancer CNVs - each confer only a minor increase in disease risk, but collectively they may cause a substantially elevated risk. In contrast, the mutations associated with hereditary cancer syndromes are frequently highly penetrant on their own and are usually inherited in an autosomal dominant manner. Unlike low-penetrance alleles, rare high-penetrance mutations will almost always co-segregate with the disease in families.

There are over 200 cancer syndromes and although most arise infrequently, they account for 5-10% of all cancer cases [28]. These are caused by base-pair-sized germline mutations in many central tumor suppressor genes - such as TP53, APC, BRCA1, BRCA2, PTEN, and RB1 - and (fewer) oncogenes, including HRAS and RET.

The role of large structural mutations in cancer syndromes has been less appreciated, probably because genomic deletions or duplications are not readily detected by PCR-based sequencing. New multiplexing methods, especially multiplex ligation-dependent probe amplification (MLPA) [29], allow targeted copy number assessment of single gene or exon changes. This has led to a recent upsurge in discoveries of patients and families with rare pathogenic CNVs that strongly predispose to cancer. Of the 70 germline cancer genes in the Cancer Genes Census [30], 28 have been reported to be mutated by genomic deletion or duplication (the genes and citations are shown in Table 1). We hypothesize that many of the remaining gene mutations will be found to have a genomic equivalent and, perhaps more importantly, that predisposing CNVs will be found in other regions not usually associated with hereditary cancer. A recent report by Jackson et al. [31] describing five patients with rhabdoid predisposition syndrome and deletions at SMARCB1 (22q11.2) highlights the benefits of a global approach to CNV detection: using SNP arrays to gain a broad perspective on the SMARCB1 deletion and surrounding chromosomal landscape, it was found that the extent of two patients' deletions in fact extended past SMARCB1, impinging on neighboring genes, and explaining their clinical phenotype.

Table 1 Rare cancer CNVs at known cancer-predisposing genes

The presence of rare cancer CNVs leads to many questions: do they differ from base-pair changes at the same locus? What is their penetrance? What are the mutational processes that give rise to them? Do they have reciprocal deletions/duplications? Do they have long-range effects on gene expression? These questions provide fertile ground for future research. These studies may involve identifying novel CNVs in unexplained familial clusterings of cancer, or the use of in vitro models in which cancer CNVs are created to measure their effect on cellular proliferation, genomic instability and the other hallmarks of cancer [32].

One potential model to explain the contribution of common and rare CNVs to cancer predisposition is shown in Figure 3. We propose that the number of copy-number-variable regions in healthy persons is maintained by efficient DNA repair, while CNVs are more abundant in cancer-prone individuals because of germline defects in these processes. Although tumors are known to have increased somatic CNV and instability, our model suggests these alterations arise much earlier in cancer-predisposed individuals.

Figure 3
figure 3

Proposed model for CNVs in tumorigenesis. A model of copy-number-variable DNA regions in patients with sporadic (top) or inherited (bottom) cancer. We propose that healthy people maintain a similar low number of CNVs in their genomes (left; black blocks indicate inherited CNVs), whereas those at risk of developing early onset cancer have an excess of CNVs and a greater overall genomic burden of copy-number-variable DNA (middle; red blocks indicate somatically acquired CNVs). As a tumor grows, it acquires more copy-number-variable regions, including tumor-specific regions (blue). Reproduced with permission from [18], copyright (2008) National Academy of Sciences, USA.

CNVs and tumor genomes

So far we have focused here on CNVs and cancer predisposition, but similar high-resolution approaches have also driven recent studies on acquired (somatic) copy number alterations (CNAs) in tumor DNA.

Copy number alterations

Genome-scale analyses have found many formerly invisible CNAs. In an analysis of 371 lung adenocarcinoma samples using a 250,000 probe array, Weir et al. [33] identified seven recurrent homozygous deletions and 24 recurrent amplifications. The most significant amplification, at 14q13.3 and containing the novel oncogene NKX2-1, had not been found in previous studies; because of insufficient resolution and sample size, the target gene it contained had not been identified. Using an even denser array, Mullighan et al. [34] profiled the DNA copy number changes of 242 pediatric acute lymphoblastic leukemia (ALL) patients, including 192 with B-progenitor leukemia (B-ALL) and 50 with T-lineage leukemia (T-ALL). Global differences between the subtypes' genomes and recurrent abnormalities at specific loci were identified. An average of six CNAs were found per leukemia genome, but significant differences in the number of CNAs were found within the B-ALL group and between the B-ALL and T-ALL subtypes. Intriguingly, in 30% of B-ALL patients, the authors [34] detected deletions of PAX5, a transcription factor that is expressed during early stages of B-cell development. Using CNA analysis to pinpoint critical genes can also help to plan subsequent sequencing efforts. For example, having identified deletions at PAX5, the authors [34] found that an additional 14 patients had point mutations in the same gene.

Using CNAs to define the key pathways of a tumor

In glioblastoma, CNA information, mRNA expression levels and methylation changes have been measured and nucleotide mutational analyses have been carried out [35]. Integrative analysis has shown that over 70% of tumors carry alterations in the retinoblastoma, p53 and receptor tyrosine kinase pathways. Although cancer is driven primarily by alterations of the genome, this study [35] and others have shown that CNA profiles can be combined with other high-throughput data to create insights that are 'greater than the sum of their parts'.

Conclusions and perspectives

The study of cancer and CNVs is in its infancy but is maturing quickly. In considering the effect of this form of genetic variation on cancer predisposition, cancer gene expression and tumor genome profiling, there is much to learn from past studies on genomic disorders. Denser micro-arrays, next-generation sequencing and integrative informatics analyses are around the corner and promise to uncover new CNVs and CNAs.

There are, therefore, many exciting questions to be addressed: what role do CNVs have in cancer predisposition and how can we use this newly discovered form of genetic variation to identify those most at risk? Which cancer-related genes are affected by CNVs and, of these changes, which are both necessary and sufficient to cause neoplastic growth? Can incipient cancer cells use these constitutional deletions and duplications to induce or accelerate tumorigenesis and tumor proliferation? As these questions are resolved, the potential value of cancer CNVs as novel biomarkers of cancer susceptibility and initiation, and of cancer progression and metastases, will become apparent. Whether cancer CNVs offer insight into genes that might be targets for novel drug development remains to be determined.