Introduction

For more than a century, it has been possible to image and classify chromosomes using microscopy. As karyotyping techniques were refined, changes in chromosome number as well as structure were recognized and it became possible to correlate these changes in genomic composition and architecture to disease. In the last 5–10 years, the advent of technologies such as array comparative genomic hybridization (aCGH) and next-generation sequencing (NGS) has allowed finer resolution of structural variants (SVs) ranging in size from entire chromosomal arms to a single base pair and has demonstrated that SV frequencies can range from extremely rare events to common population polymorphisms. In this review, we examine the causes, repair mechanisms, and errors that result in SV as well as some of the biological factors that play a role in the development of SV within the human genome.

Causes of SVs

DNA damage is a common insult to cells. It is estimated that the genomes in every cell of the human body normally undergo as many as 10,000 lesions per day [13]. In contrast to many forms of single nucleotide lesions, SVs result as an improper repair of a double-strand break (DSB), a lesion in which both strands of the phosphodiester backbone of DNA are broken, with frequencies of up to 50 DSB per cell cycle (Fig. 1a) [4]. DSBs are caused by factors that arise from internal cellular function (i.e., endogenous) and environmental insults (i.e., exogenous). Endogenous factors include reactive oxygen species, improperly repaired single nucleotide lesions, unrepaired single strand breaks, DNA replication stress, and self-induced DSB in meiotic and lymphoid cells [512]. Exogenous DNA damage can be caused by chemical mutagens, which bind to or directly change the DNA structure, and by ionizing radiation (i.e., high energy photons including X-rays and gamma rays), which can break nucleotide bonds or produce nearby free radicals [1316]. In addition to these insults, DSB can also occur as a result of transposable elements within the genomic DNA [17]. If not repaired expeditiously, DSBs can quickly lead to destabilization of the genome and result in catastrophic events for the cell and organism [18, 19].

Fig. 1
figure 1

Schematic of mechanism involved in SVs. a Causes of double-strand breaks (DSBs) include several factors including endogenous, exogenous, and programmed sources, and results in breakage of both phosphodiester DNA helix backbones. b DSB repair (DSBR) occurs through two major pathways: 1) non-homologous end joining (NHEJ), which occurs throughout the cell cycle; or 2) homologous recombination (HR), which occurs primarily in S and G2 phases of the cell cycle. In NHEJ, either classical NHEJ (c-NHEJ), where no resectioning has occurred, or alternative NHEJ (a-NHEJ), where minor resectioning has occurred, is employed. HR requires major resectioning and then can repair through single strand annealing (SSA) or form a D-loop which is rectified either by synthesis dependent strand annealing (SDSA) or the canonical HR pathway, which forms a double Holliday junction (dHJ) that can dissolve as a non-crossover event (NCO) or resolve through cleavage in either a NCO or a crossover (CO) event. c DSBR can result in no normal repair, where the original chromosomal architecture is intact or result in structural variants (SVs) that can have a combination of losses, gains and/or rearrangements

Repair of DSBs

Eukaryotic cells have developed at least two types of repair pathways to resolve DSBs: homologous recombination (HR) and nonhomologous end joining (NHEJ) (Fig. 1b). HR is the primordial mechanism of double-strand break repair (DSBR) first discovered as part of meiotic crossover [20]. Homologous regions of the genome (e.g., homologous chromosomes and sister chromatids) serve as a template to prime HR, which can result in error-free replication of the original DNA sequence. In addition to meiotic crossover events in gametes, HR is most prevalent between S and M phases (when sister chromatids are available as a repair template) in eukaryotic diploid cells [21]. DSBR also occurs in the complete absence of homology in a process known as NHEJ. In NHEJ, the exposed DNA ends created by the DSB are directly religated to each other. Because NHEJ does not rely on a homologous chromosome as a template, it can occur during any phase of the cell cycle. Problematically, NHEJ ligation can occur between any two exposed DNA ends, whether or not they result from the same DSB or whether any nuclease activity has occurred [22]. This lack of specificity can result in a loss of normal genome structure suggesting that NHEJ is an error-prone mechanism. Despite the nonspecificity of DNA end joining, NHEJ is the predominant DSBR mechanism in higher order mammals [23]. While most of the DSBR mechanism can fall under these two headings, HR and NHEJ pathways differ among species and between cells. Starting with NHEJ as the default mechanism in mammals, a brief description of the DSBR molecular pathways is provided below (see also Table 1) [23]. Excellent in-depth descriptions of proteins involved in DSBR can also be found in several recent reviews [24••, 25••, 26].

Table 1 Essential proteins involved in double-strand break repair

Nonhomologous End Joining (NHEJ)

c-NHEJ

After a DSB in mammals, NHEJ is often the initial DSBR mechanism due in part to its fast kinetics [26]. In the canonical, or classic, form of NHEJ (c-NHEJ), DNA ends created by the DSB are bound and processed by heterodimers of the Ku proteins (Ku70 and Ku80 in humans) [27]. The Ku heterodimer binds to and activates DNA–PK cs to form the DNA–PK complex. DNA–PK is able to bond across the gap between DNA ends, tethering them together and forming the synaptic complex [28]. DNA–PK can recruit ARTEMIS, an enzyme with both exo- and endonuclease activity, to process non-ligatable DNA ends [29, 30]. Processed DNA ends are then phosphorylated by DNA–PK, while the synaptic complex protects against additional nuclease activity [31, 32]. Religation of the DNA ends is performed by the DNA repair protein and DNA ligase IV (XRCC4/LIG4) complex, the activity of which is promoted by XLF to repair the DSB [3335].

a-NHEJ

When c-NHEJ is unable to process the DNA ends for ligation or Ku binding is inhibited, an alternate form of NHEJ (a-NHEJ, also known as microhomology-mediated end joining or MMEJ) can be employed. a-NHEJ is initiated by PARP1, which can also bind directly to DNA ends and compete with Ku proteins [36]. PARP1 is able to modify histones to create a favorable repair environment as well as recruit the MRN complex (Mre11, Rad50, Nbs1), which in turn displaces the PARPs on the DNA ends [37]. MRN serves to tether and process the DNA ends and recruit additional DNA repair enzymes including CtIP (also known as RBBP8) [38]. MRN/CtIP performs exonuclease activity to resect the 5′ strand resulting in an exposed 3′ single stranded DNA (ssDNA) overhang [39, 40]. In a-NHEJ, resection of DSB ends is limited and followed by annealing of the resected ssDNA to each other through microhomology sequences (5–25 nucleotides) [39]. Ligation is then performed in a process similar to single strand break repair mechanisms using the DNA repair protein XRCC1 and DNA ligase III (XRCC1/LIG3) complex [41]. Nucleotides 3′ to the microhomology sequence are removed in this process resulting in the loss of several base pairs at the initial DSB.

Homologous Recombination (HR)

Initiation

The HR pathway is initiated after major resectioning of DNA ends to expose ssDNA. Initial binding of DNA ends can occur by PARPs, as in a-NHEJ, or by direct binding of the MRN complex [42]. MRN again binds CtIP and this complex promotes nuclease activity [43]. In S and G2 phases, BRCA1 binds to CtIP and may participate in end resectioning, and when combined with other nucleases like EXO1 leads to more extensive resection [39, 4447]. The extended ssDNA produced by resection is stabilized by RPA to prevent nuclease activity on the exposed 3′ strand [48].

Single Strand Annealing (SSA)

SSA can occur in areas of the genome where DNA repeat sequences (e.g., tandem repeats, interspersed repetitive DNA) are highly concentrated. Rather than a sister chromatid or homologous chromosome, repeat sequences in the ssDNA serve as a template for HR. Rad52 propagates annealing between the 3′ ssDNA [4951]. After annealing, a complex of ERCC1 and XPF binds to Rad52 and cleaves nucleotides 3′ to the repetitive sequences in a process similar to nucleotide excision repair [52, 53].

Displacement Loop (D-loop)

In lieu of SSA, BRCA2 facilitates Rad51 binding to a 3′ ssDNA overhang, displacing RPA and forming a nucleoprotein filament [54, 55]. The Rad51 filament facilitates invasion and annealing of the 3′ ssDNA overhang to homologous sequences and forms a structure known as the D-loop [56]. Once the D-loop has formed, template-based DNA synthesis can be initiated by DNA polymerase [57]. The D-loop can then be resolved through several different paths [56].

Synthesis-Dependent Strand Annealing (SDSA)

In SDSA, the D-loop collapses and the newly synthesized 3′ DNA strand is annealed back to the original DSB. This is mediated by RETL1, a helicase, which promotes displacement of the newly synthesized 3′ DNA strand [58]. Once released, the newly synthesized strand anneals back to the reverse complementary exposed ssDNA created from the resection of the origin DSB, repairing one side. The newly synthesized strand can then serve as the template for polymerase-based synthesis on the complementary strand. SDSA does not result in a chromosomal crossover (i.e., exchange of genetic material); however, since the homologous sequence served as a template for synthesis, the sequence at the original DSB will be converted to the homologous sequence, a process known as gene conversion (GC) [59, 60].

Canonical Homologous Recombination (c-HR)

In the c-HR pathway, the newly synthesized DNA is not immediately released. Instead, the complementary 3′ ssDNA also invades the D-loop and begins template-based synthesis on the complementary strands of the homologous sequences, forming a structure known as the double Holiday junction (dHJ). Both synthesizing strands ligate back to the resected 5′ DNA ends from the initial DSB [61]. The dHJ can either resolve in a dissolution state, where the dHJ unwinds and maintains the original chromosomal architecture, or in a resolution state, where the ends of the dHJ are cleaved and genetic material is transferred between the homologous sequences. Dissolution is mediated by a complex of BLM and TOPOIIIα proteins and can result in GC without exchange of flanking sequences [62]. Resolution of the dHJ does result in exchange of flanking materials and is performed by endonucleases such as MUS81, EXO1, EME1, and SLX4/SLX1. Resolution can result in GC and both crossover and non-crossover events [63].

Break Induced Repair (BIR)

BIR is a form of DSBR most often associated with replication stress and shares many features in common with HR. In BIR only one end of the DSB is involved. As in HR, the 3′ ssDNA overhang invades the homologous sequence forming a D-loop and primes DNA synthesis by DNA polymerase. The D-loop then migrates along the chromosome as the 3′ end extends. DNA synthesis in the opposite direction can then occur either on the newly synthesized leading strand or on the complementary ssDNA exposed by the D-loop. BIR, and a form of BIR that relies on microhomology (MMBIR), is often associated with breaks that occur during DNA replication in S phase. The DNA replication fork can stall due to such factors as errant nucleotide repair, single strand break, and helicase inhibition. Replication is restarted by inducing a DSB that is repaired by BIR [6466]. A similar mechanism, known as fork stalling and template switching (FoSTeS), also suggests a mechanism for restarting replication forks, but without a DSB [67, 68].

Control of DSBR

Coordination of DSBR is essential for proper repair and prevention of further damage. After a DSB, the cell delays progression through the cell cycle, to allow time for DSBR, or, if the damage is too extensive, signals apoptosis. Two master control kinases are employed to organize these events: ATM and ATR. ATM can activate both p53, a pro-apoptotic factor, and CHK2, which delays cell cycle progression. ATR delays cell cycle progression through the CHK1 cascade. Both ATM and ATR have extensive phosphorylation cascades involving essential DSBR proteins including ARTEMIS, BRCA1, BRCA2, and MRN subunits [69, 70]. ATM and ATR, as well as DNA–PK cs , can also phosphorylate H2AX, converting it to its active state γ-H2AX. γ-H2AX serves as a focal point in DSBR through chromatid remodeling and as a platform to assemble other enzymes involved in DSBR [71]. ATM is one of the first proteins activated in the DSBR pathway and though initial activation is still unclear, ATM is phosphorylated by the Nbs1 subunit of the MRN complex during early stages of a-NHEJ and HR [69]. ATR is activated later through an interaction with RPA proteins associated with ssDNA [70]. Many other proteins are involved in the control of DSBR, depending on cell stage, age, and type, and many of the proteins involved in these pathways are still being elucidated.

SVs Resulting from DSBs

For the most part, DSBR results in faithful reproduction of the original genetic architecture before the DSB. In some cases, errors do occur in DSBR that may result in many different forms of SVs. Several SV classification systems have been generated depending on the mode of detection. Microscopic techniques (e.g., karyotyping) describe SVs at the resolution of chromosomal banding patterns (>3–10 Mb, depending on banding resolution), including large deletions, insertions, and rearrangements, but are limited in resolution. Higher resolution molecular techniques (e.g., array comparative genome hybridization or aCGH, and single nucleotide polymorphism arrays or SNP arrays) describe SVs characterized by genomic imbalance, or copy number variations (CNVs), but are often unable to detect dosage imbalances below the resolution of probe density, and are blind to some rearrangements of chromosomes [72•, 73, 74]. As NGS techniques are developed offering both architecture characterization and high resolution, these traditional classification systems are being redefined [75]. A comprehensive classification system can describe SVs by (1) losses, (2) gains, or (3) rearrangements of segments of nucleotides compared to the reference genome (Fig. 1c). Rearrangements can be further subdivided into inversions, when the segment is in an opposite orientation, and translocations, when a genomic segment is displaced to a different part of the genome. Many traditional classification systems fit within this rubric. For example, deletions, copy number losses, microdeletions, and indel losses are all SV losses, while duplications, copy number gains, microduplications, and indel gains are all SV gains. DSBs that result in rearrangement SVs can also be accompanied by gains or losses and can be considered unbalanced. Some examples of unbalanced rearrangements include many forms of nonhomologous translocations as well as translocations caused by telomere erosion such as ring chromosomes, intrachromosomal ligation of telomeres, or Robertsonian translocations, the ligation of beta-satellite sequences after loss of the majority of the short arm in acrocentric chromosomes (i.e., human chromosomes 13–15, 21, and 22). More complex SVs may be characterized by all three properties; for example, an insertion can arise from the duplication of a segment of one chromosome (SV gain) that is inserted into DNA sequence of a separate chromosome (SV rearrangement) replacing the original DNA sequence (SV loss). As the detection of SVs by NGS now offers nucleotide resolution, more definitive classification systems are required to describe accurately the losses, gains, and rearrangements that result from SVs.

Gains, losses, and rearrangements can arise due to errors in any one of the DSBR mechanisms; however, certain mechanisms are more prone to certain types of errors than others. NHEJ is often considered an error-prone mechanism because no template is used in DSBR. Instead, any DSB end protected by Ku heterodimers can be used for ligation in c-NHEJ. Even in the presence of a single DSB, involving only two Ku protected ends, nucleotides may be lost either through the initial break event or during end processing by ARTEMIS. If the presence of multiple DSBs, c-NHEJ will ligate Ku protected ends regardless of whether they originated from the same break or from different chromosomes resulting in dramatic changes to chromosomal architecture, including rearrangements [22, 76]. Although microhomology is used in a-NHEJ, this mechanism typically results in SVs. Losses will almost always be generated as nucleotides are excised 3′ to the microhomology site. In addition, microhomology sites can bind promiscuously and while advantageous in cases of simple DSBs, in the presence of multiple DSBs, a-NHEJ can result in many forms of rearrangements [40, 77]. On the other end of the spectrum, HR is considered to be a high fidelity mechanism, because homologous sequences serve as a template for repair. This is often true when sister chromatids, which are faithful duplications of each other, serve as the homologous template. When homologous chromosomes are used as a template in HR, differences (e.g., single nucleotide variants, or SNVs) between the invading strand and template will result in mismatches, which are rectified by DNA repair mechanisms. This results in a loss of heterozygosity between the homologous chromosomes and is referred to as GC. The effects are more dramatic when templates are used that share high sequence homology, but are non-allelic. This non-allelic homologous recombination (NAHR) can occur on templates that are within the same chromosome, in sister chromatids, in homologous sequences, or in entirely different chromosomes. NAHR can result in complex SVs with unbalanced rearrangements resulting in losses or gains of entire chromosomal arms. NAHR is likely to occur if HR is employed during G0 and G1 phases of the cell cycle, when sister chromatids are not available [26, 78]. The rate of DSB and repair mechanism are also influenced by the development state and cell type.

DSBs can occur at any point during development from pre-conception to adult. During meiosis prophase I DSBs are repaired by HR. In mammals, as the cell begins rapidly dividing, NHEJ becomes the predominant DSBR throughout the cell cycle [21, 23]. After mitosis, the cell enters G1 and NHEJ is the dominant DSBR mechanism [23]. HR in G1 is likely suppressed in that there are no sister chromatids available, meaning any HR would result in either GC of homologous chromosomes or SVs resulting from NAHR. As cells progress towards S/G2 phase, and sister chromatids become available, HR is upregulated. CtIP modification, by ATM and BRCA1, may play a central role in this transition from NHEJ to HR in the cell cycle [79, 80]. The DSBR pathway is also regulated dependent upon cell type; for example, lymphocytic stem cells create genetic diversity in immunoglobulins and T cell receptors by self-induced DSB that are repaired by NHEJ (see below) [81]. In later stages, as cells age, SVs may begin to accumulate and may become oncogenic [82]. Given the frequency of DSB and the errors that can occur, DSBR mechanisms are tightly controlled pathways; however, in certain processes within eukaryotic cells, DSBs are actually self-induced.

Random, Recurrent, and Programmed DSB

For the most part, DSBs are caused by random events at arbitrary locations. However, there is evidence that some areas of the genome are more susceptible to DSBs, areas known as recombination hotspots. Eukaryotic cells are also capable of self-inducing DSBs to generate genetic diversity.

Recombination Hotspots

Recombination hotspots are regions of elevated occurrences of SVs, which include fragile sites, segmental duplications (SDs), and transposable elements (TEs). Fragile sites are areas of the chromosome that are prone to breaks, gaps, and constrictions in metaphase chromosomes. These sites are thought to be prone to DSBs due to replication stresses such as fork slippage, caused by repetitive sequences, or fork stalling, caused by DNA hairpin formation. SDs, also known as low copy repeats (LCR), are areas of >10 kb that share high sequence homology. Regions enriched for SDs are associated with high occurrences of SVs due in part to their shared homology, which can serve as templates for NAHR, but are also known to undergo high rates of NHEJ. Interestingly, there has been a rapid expansion of SDs in the evolution of primates, suggesting duplications of duplications, despite their role in inducing SVs [83]. TEs are sequences that have likely originated from viral integration and are capable of moving around the genome. TEs can be classified as class I “copy-and-paste” elements, requiring an RNA intermediate, and class II “cut-and-paste” elements, requiring no RNA intermediate. Due to their unique properties, TEs are involved in both NHEJ and NAHR. While being excised or reintegrating following retrotransposition, TEs can induce DSBs that follow NHEJ repair; however, due to their homology they can also serve as a template for NAHR. Although both SD and TE loci are often associated with disease, theories suggest that these hotspots have promoted rapid evolution in the human genome [84, 85•, 86, 87•].

Programmed Rearrangements

Eukaryotic cells have evolved to take advantage of what in other circumstances would be considered an error in DSBR. In germline cells, SVs are produced, which may help create population diversity within a species. In prophase I of meiosis, DSBs are also induced by SPO11, which are repaired through HR. HR may result in crossing over between homologous chromosomes and unique segregation of genes not possible from the chromosomal structure of the parental genomes [88]. Only a small fraction of the SPO11-induced breaks will result in crossovers between homologous chromosomes while most will result in pairing with sister chromatids and/or non-crossover events. Meiotic crossover hotspots are also evident, but it is unclear what controls the SPO11 susceptibility and crossover regulation [89]. Meiotic crossovers can also be repaired improperly specifically as a result of NAHR, creating de novo SVs that can range from inert to embryonic lethal. During development, more restricted induced DSBs occur in immunological cells, creating antibody diversity that is essential for protecting against viral and bacterial infection. During development of lymphocytes, V(D)J (for variable, diverse, joining) and class-switch DNA recombination (CSR) create recombinations of regions of the immunoglobulin and T cell receptor to create diversity. In V(D)J recombination, DSBs are induced in the variable regions of both immunoglobulin heavy and light chains as well as the T cell receptor by RAG1 and RAG2 at conserved AT-rich heptamer sequences, and these breaks are subsequently repaired through NHEJ [90•]. In CSR, DSBs are induced in the constant region of the immunoglobulin locus by AID and the break is subsequently repaired by either c-NHEJ or a-NHEJ [91••]. While V(D)J and CSR are able to create an immunoglobulin repertoire in the tens of billions, this induced recombination likely arose out of viral elements similar to TE and is not without its consequences as a possible contributor to lymphoid malignancy [81]. Recent reports suggest that induced DSBs may not be restricted to meiotic and lymphocyte cells, but may also contribute to the diversity of neurons [92, 93].

Complex SVs

Next-generation sequencing of the genomic architecture of cancers and human neurodevelopmental disorders has revealed complex structural variants (also known as complex genomic rearrangements or CGRs) involving a combination of two or more deletions, duplications, insertions, or inversions over one or more chromosomes. In one study, most of the complex events detected involved one or more cryptic inversions at the rearrangement breakpoints [94•]. The most striking example of complex SVs is a phenomenon known as chromothripsis (Greek, thripsis: shattering), whereby chromosomes are fragmented into tens to hundreds of segments and rearranged with or without accompanying alterations in copy number. First described in chronic lymphocytic leukemia (CLL), it is now believed that 2–3 % of all cancers and up to 25 % of bone cancers harbor chromothripsis events [9598]. Germline chromothripsis has also been reported in cases of developmental delay or cognitive defects [78, 94•, 99]. In some cancers, this extensive genomic rearrangement likely occurs as a single event of breakage and repair rather than as the aggregation of numerous structural insults, evidenced by copy neutral states, regional clustering, and interspersed loss of heterozygous segments in the derivative chromosomes [100••]. However, the trauma and repair of chromosomal shattering may occur through multiple mechanisms, and this simultaneity does not necessarily characterize complex SV cases that are accompanied by extensive CNV [78]. While the mechanisms underlying the shattering and religation of catastrophic chromosomal events remains an area of active investigation, several theories have been proposed. Chromosomal shattering may occur by ionizing radiation or free radical oxidation to condensed mitotic figures, by breakage of dicentric chromosomes during anaphase, by recovery of a cell after an aborted apoptosis, or by formation of micronuclei containing anaphase lagging chromosomes [101, 102•, 103, 104]. Following the genomic breakage, data from Chiang et al. [94•] and others suggest that events studied to date are repaired with little to no homologous sequence at the breakpoints, indicating a predominance of error-prone NHEJ repair [94•, 98100••]. Nonetheless, templated homology- or microhomology-based repair (e.g., MMBIR and FoSTeS), likely play a role in a subset of chromosomal shattering events, particularly those that are accompanied by copy number changes [78].

Summary

Although SVs have been visible under the microscope for several decades, only recently have some of the complexities that underlie these changes to the genomic structure become apparent. It has become evident that many SVs stem from errors in the repair of DSBs, but the molecular mechanism of repair and the cause of these errors are still being deciphered. Even more confounding, some SVs are induced by the cell itself to produce genetic diversity within the organism as well as the population. The advent of next-generation sequencing technologies has offered new perspectives on the normal and pathological architecture of the genome. These tools have allowed us to recognize that SVs are not only common, but also make up a significant proportion of human variation.