Background

The identification of structural variants (SVs) in mammalian genomes [14] has important implications for our understanding of genetic diversity, has elucidated the concept of genomic disorders [5, 6] and has improved the analysis of genetic association in common and rare diseases [712], cancer development [13] and genomic evolution [14, 15]. However, the accurate identification of SVs in mammalian genomes remains challenging.

Next generation sequencing provides a novel approach for identifying structural variants [16] and exploits read-pair information [17, 18], split reads [19, 20], read depth [21] and sequence assembly [22] to localize SVs. Typically, variation in the expected number of reads mapping to the reference sequence is used to identify copy number variants while deviations from the expected distance between reads, and the orientation of reads, is used to infer the presence and type of structural variant at a locus. These methods presuppose that sequencing reads form characteristic patterns for different types of structural variants [23]. For example, when the two sequenced ends of a fragment map back to the reference genome in the correct orientation, but at a distance that is significantly larger than the size of the fragment itself (as inferred from the library insert size distribution), this indicates a deletion.

Algorithms that use whole-genome sequence reads make assumptions about the paired-end mapping (PEM) patterns they will encounter, even though we know that the molecular architecture of structural variants can be remarkably complex [24, 25]. For example, deletion and inversion events that appear simple may contain additional sequence at breakpoints and different types of structural variants sometimes occur together, so that, for example, an inversion immediately abuts a deletion [26]. However, current automated methods to identify SVs are unable to differentiate basic patterns (for example, a simple inversion) from more complex ones (for example, an inversion right next to a deletion), resulting in some SVs being incorrectly classified while others are missed altogether.

Past studies have described several forms of complex structural variants, ranging from multiple rearrangements at large genomic loci [27, 28], to deletions, inversions, insertions and duplications that fall in close proximity [29]. More recently, a subtle form of complex SVs has been characterized by micro-insertions or micro-deletions at the breakpoints of larger SVs [30]. In our present study, we describe complex SVs as two or more structural variants co-occurring at the same locus, without intervening DNA of normal structure between the variants (SVs are directly adjacent to each other) and without distinction by SV size (complex SVs can be two or more large SVs right next to each other or a small SV right next to a larger SV).

Here we combine visual inspection of PEM data from 17 mouse genomes [31] with experimental validation to resolve the molecular architecture of SVs and to guide genome-wide computational analysis [32]. We provide a comprehensive catalog of 21 PEM patterns derived from simple and complex SVs, and show how these patterns may provide insights into the fine-scale molecular architecture of SV formation.

Results

Catalog of paired-end mapping patterns

We started by generating a set of validated PEM patterns that we could use to guide genome-wide computational analysis. To do this, we visually examined short-read PEM patterns and manually called SVs from 100 Mbp in 17 inbred strains of mice [31, 32] (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6NJ, CBA/J, DBA/2J, LP/J, 129S5SvEvBrd, 129P2/OlaHsd, 129S1/SvImJ, NOD/ShiLtJ, NZO/HlLtJ, CAST/EiJ, PWK/PhJ, WSB/EiJ and SPRET/EiJ) that included the whole of mouse chromosome 19 (61 Mbp in size), and a random set of other chromosomal regions. We provide an overview of the procedure to catalog PEM patterns in Figure 1a, b.

Figure 1
figure 1

An overview of the procedure to characterize structural variants. A flow diagram of the different steps we used to characterize structural variants. (a) We first inspected 100 Mbp of next generation sequencing (NGS) data in 17 genomes. We examined chromosome 19 in its entirety and a random set of other chromosomal regions. (b) We identified 21 PEM patterns across 1,494 SV sites. We show two examples of PEM patterns, H1 and H2, as visualized using the LookSeq tool [38]. The H1 deletion is on chromosome 1 and has strain distribution pattern 11110101 (1 means presence and 0 absence of the deletion) in the following strain order: A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J. The H2 deletion is on chromosome 8 and is present in seven strains. (c) We randomly selected 662 SV sites for PCR-validation to investigate all PEM patterns and show the results for the two deletion SVs. (d) We randomly selected 261 SV sites for analysis of breakpoint sequence features using Sanger-based sequencing technology. We show sequencing data of a simple deletion of type H1 on chromosome 10 (20,467,348-20,467,894). Sequence analysis confirmed the deletion of 547 bp but also revealed an insertion of 34 bp.

Based on read depth and anomalous PEM, we identified 21 patterns, as described in Table 1 and Additional file 1. We unambiguously classified 11 PEM patterns, referred to as 'H' patterns, for high confidence. While some of the H patterns are typical and have already been described [23], others (H3, H5, H9 and H11) are novel. Figure 2a shows the novel PEM pattern H5, an inversion directly flanked by two deletions. Note that depending on the size of the inversion, the H5 pattern of paired-end reads will differ: for instance, suppose the length of the inversion is small, H5 reads will span both deletions and inversions, giving a PEM pattern of a typical deletion; suppose now the inversion is larger, H5 reads will individually span each deletion, giving a PEM pattern of an inversion.

Table 1 The 21 PEM patterns with their corresponding SV type
Figure 2
figure 2

Novel PEM patterns. (a) PEM pattern of a Del(s)+Inv (H5). We draw paired-end reads (black arrows) and how they map to the reference (Ref.) genome C57BL/6J (dashed grey lines). Blue boxes represent deletions and orange boxes inversions. Green arrows represent primers used for PCR amplification and sequencing reactions. (b) PEM pattern of an Inv+Ins (Q8), with PCR data across eight classical strains (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J). The pink box represents de novo sequence insertion. The amplicon size for BALB/cJ, C3H/HeJ, CBA/J and DBA/2J is about 500 bp larger than the other strains, indicative of the insertion.

Interpretation of the remaining ten PEM patterns was ambiguous. We refer to these as type Q ('Questionable') patterns (Q1 to Q10; Table 1; Additional file 1). With the exception of Q4 (large deletion), the remaining Q patterns have not been described before. Two patterns appeared false positives (Q6 and Q10). Q1 and Q7 were due to variable number tandem repeats. Q5 and Q9 were difficult to interpret: read-pair information suggested a structural variant while read depth did not. Q2, Q3 and Q8 had partial patterns compared to typical SV patterns. For example, Q2 has a read depth of zero flanked by regions of normal read depth but it does not have paired-end reads spanning the length of the variant as does a typical deletion.

We found that partial PEM patterns were caused by the presence of a de novo insertion right next to a deletion or inversion. Figure 2b shows an example of an inversion flanked by a de novo insertion. A paired-end read spanning the first breakpoint of the inversion is mapped as expected to the reference genome. However, one end of the fragment spanning the second breakpoint of the inversion does not map to the reference genome (because it lies within the inserted sequence), creating a signature of an 'orphaned' read mate.

From the 100 Mbp we visually inspected, we identified a total of 1,494 SVs that matched the 21 PEM patterns (Additional file 1). Because visual identification of H6 and H7 patterns was more difficult than for the other variants, we excluded them from our analysis of chromosome 19 and identified 872 deletions (631 type H and 241 type Q) bigger than 100 bp, 15 inversions (2 type H and 13 type Q) and 3 copy number gains (all type H) (Additional file 2). In addition to the 890 SVs identified on chromosome 19, we found 604 on the other chromosomes. We refer to these lists on chromosome 19 and other chromosomal regions as our gold-standard list of PEM patterns.

Since we examined the whole of chromosome 19 in eight strains, we looked at the distribution of SVs along the chromosome in the context of regional features. To do this, we counted the number of SVs overlapping protein-coding genes, coding exons and repeat regions on chromosome 19 using Ensembl build 65, and we compared this to a null distribution of the expected number of overlaps, obtained by performing a permutation analysis. Across all strains, we found a non-random distribution of SVs along the chromosome (Additional file 3) with enrichment (P < 0.01, fold change 2.2) in repeat regions and depletion (P < 0.01, fold change 0.25) in coding regions. We found only two SVs on chromosome 19 that affect one, or more, coding exons of genes involved either in immunity or olfaction (Additional file 4).

Architecture of SVs using PEM inspection

Next we tested the molecular architecture of SVs as inferred by our visual inspection of PEM patterns (Figure 1c). To do this, we applied a primer design strategy depending on type and length of the SV (Additional file 5) and confirmed the underlying molecular structure of all 21 PEM patterns using PCR- and Sanger-based sequencing across 8 (A/J, AKR/J, BALB/cJ, C3H/HeJ, C57BL/6J, CBA/J, DBA/2J and LP/J) of the 17 Mouse Genomes Project strains [31]. These eight strains are the progenitors of the HS (heterogeneous stock) [33], an outbred population we have used to achieve genome-wide high-resolution mapping of multiple phenotypes [34].

We designed 742 pairs of primers (Additional file 6) and successfully amplified 662 SV sites (Additional file 7). It should be noted that we excluded failed designs (due to the presence of SNP(s) in the primer sequences) and designed 80 additional primer pairs to amplify sites when one primer pair yielded no information or only part of the answer - for instance, for a deletion adjacent to an insertion predicted by visual inspection of the PEM. With the exception of insertions (excluded from our chromosome 19 data set), SV sites we analyzed were representative of the overall distribution of PEM categories (Additional file 8).

We defined simple SVs as those whose structural interpretation is straightforward and consists of one SV type: insertions, deletions and inversions (Figure 3a). We also identify another type of insertion, a copy number gain, as consisting of non-repetitive DNA that is present in multiple copies relative to the reference genome. When this sequence occurs immediately adjacent to its original, it is annotated as tandem duplication; when it is small and close to another copy, it is annotated as a linked gain.

Figure 3
figure 3

Architecture of structural variants. (a) Simple SVs: deletion (Del), insertion (Ins), inversion (Inv), tandem duplication (tandem Dup) and other types of copy number gains. Linked gain is a small copy number gain at close proximity to its copy. Inverted linked gain (not drawn) is similar to a linked gain but the copy is inverted. Del+Nml+Del is two deletions separated by a normal copy of small size. (b) Complex SVs: deletion co-occurring with insertion (Del+Ins), inversion with flanking deletions (Del(s)+Inv), inversion with insertion (Inv+Ins), deletion within a copy number gain (Del in gain) and inversion within a copy number gain (Inv in gain).

In contrast to previous SV studies that use the number of breakpoints that fall in close proximity, our definition of complex SVs is based on the mixture of SV types (of small or large size) that directly abut each other, with no intervening DNA, since these might be the progeny of a single process (marked as Del+Ins, Del(s)+Inv and Inv+Ins in Figure 3b). We also separately identify an SV within a copy number gain (termed 'Del in gain' and 'Inv in gain' in Figure 3b) since the probability of coincidence is less than one event per genome.

Our categorization of predicted SV structures, based on manual inspection of PEM patterns, resulted in the highly confident identification of a structural variant for 18 of the 21 patterns that we examined by PCR: 12 were indicative of a simple SV and 6 of a complex SV (Table 1). Two patterns did not represent structural variants (Q6 and Q10), but were due to the presence of a retrotransposed pseudogene, which caused false SV patterns. SVs of type Q7 (55 cases) were due to variable number tandem repeats, for which we could not predict the number of repeats or molecular structure (Additional file 8).

We estimated the relative proportions of simple and complex SVs by manual inspection of PEM patterns on chromosome 19. Assuming an equal number of deletions and insertions on chromosome 19, then about 88% of SVs are composed of one SV, 2.5% of two adjacent SVs at the same locus and 9.5% are variable number tandem repeats (Additional file 8). Note that we have not recorded SVs on chromosome 19 where three (or possibly more) different types of SVs co-occurred (for example, a deletion right next to an inversion and insertion), although about a dozen rearrangements had three different types of SVs based on their PEM patterns. Consequently, our estimate of the number of complex SVs on chromosome 19 based on PEM inspection is likely to be conservative.

Fine-scale architecture of simple SVs

To gain insights into the fine scale architecture of simple SVs as inferred by our visual inspection of PEM patterns, we randomly selected 261 simple SV sites and analyzed their breakpoints (Figure 1d; Additional file 8). Using the rat as an outgroup species, we inferred SV ancestry (as described in [32]), and classified SVs into two groups (ancestral insertion or deletion). We found 105 ancestral insertions and 142 ancestral deletions; the remaining 14 were either inversions (n = 6) or tandem duplications (n = 8). We sequenced a total of 4,176 breakpoints at 261 selected SV regions in 8 strains using PCR-based Sanger sequencing (Additional file 9).

We found additional complexities at breakpoints of 62 SVs (24%): small additional (micro-insertions) and small missing (micro-deletions) sequences (Table 2). Figure 1d shows an example of a simple H1 deletion; sequence analysis at the nucleotide level confirmed the deletion of 547 bp but also revealed an insertion of 34 bp.

Table 2 Sanger-based sequence analysis at 4,176 breakpoints

We determined the extent and content of micro-insertions and micro-deletions and identified three patterns of SV breakpoints (classified in Table 2). The first pattern is characterized by micro-deletions at SV breakpoints. At 15% of ancestral insertions there were missing nucleotides at the breakpoints, ranging from 1 to 289 bp.

The second pattern includes SVs that have sequence inserted at their breakpoints; 27% of ancestral deletions showed a micro-insertion, with size ranging from 1 to 107 bp. We report the origin of the sequence involved in micro-insertions in Additional file 9. There were three cases: (i) intra- or (ii) inter-chromosomal copy number gain of small size, or (iii) insertion of retrotransposons.

The third pattern of SV breakpoints is characterized by simple SVs with both occurrence of micro-deletion and micro-insertion. One third of inversions had this pattern at their breakpoint. Size of SVs was not correlated to one particular pattern, nor to micro-deletion/micro-insertion length and type.

Our analysis of breakpoint sequence features in multiple strains also allowed us to look for a relationship between sequence variants and SV formation. In particular, we addressed the question as to whether sequence variants at breakpoints were associated with SV formation.

In all cases, the presence of SNPs in the micro-homology region (short length of identical sequence at an ancestral deletion's start and end points) was correlated with the presence of the SV (Figure 4). The SNP elongates the micro-homology, or, alternatively, the micro-homology reflects a hyper-mutable state associated with break-induced replication around the SV [35]. However, this phenomenon is rare: we only observed five (4.5%) cases amongst our manually curated ancestral deletions (Additional file 9) where a SNP and SV formation co-segregate. We found a similar relationship between a SNP formed at the target site duplication and the presence of an ancestral insertion. Fifteen ancestral insertions (16%) had SNPs or short indels within their target site duplication, coincident with the insertion (Additional file 9).

Figure 4
figure 4

Relationship between SNP and SV formation. Two SNPs lying on the 6 bp micro-homology of an ancestral deletion of 64 bp (chr12:27,040,459-27,040,522) correlated with the presence of the SV. Sequencing traces are shown for a test strain (A/J) and the reference strain (C57BL/6J). Note that all other test strain traces are identical to the one shown here. The black asterisks indicate the micro-homology of 6 bp (GAACTA). The presence of two SNPs (C->G and T->A, shown using red asterisks) in all test strains (here only shown in A/J) is associated with the presence of the structural variant.

Discussion

Our findings are important in two ways. First, we show that an integrative approach using experimental analyses to train computational SV calling is essential for the accurate characterization of SV architecture. Second, we find a considerable complexity in SV formation; about a quarter of SVs in the mouse are composed of a complex mixture of deletion, insertion, inversion and copy number gain.

In contrast to studies that start by identifying SVs using automated genome-wide methods, followed by experimental validation, we started by experimentally determining a set of SVs and then applied this information to interpret whole-genome automated SV detection [32]. Laboratory-based efforts proved essential for two main reasons. First, they allowed the correct interpretation of the PEM patterns. Without knowing how to interpret the underlying molecular structure of each PEM, some patterns would be missed or classified incorrectly by computational methods alone. Second, our laboratory efforts allowed the recognition of a diversity of PEM patterns. Otherwise we would not be able to distinguish between simple and complex SVs.

Finer-scale breakpoint sequence analysis reveals that 24% of simple SVs have smaller rearrangements at the nucleotide level (micro-insertions or micro-deletions at the breakpoint of a larger SV). This raises questions about the likely mechanisms of SV formation.

We know that retrotransposition is the commonest mechanism of SV formation in the mouse [32]. We also know that retrotransposons (LINEs, SINEs and long terminal repeats) are typically characterized by flanking target site duplications and a poly(A) tail or poly(T) head. However, we observed that 15% of retrotransposon SVs do not have target site duplications and truncated or absent poly(A) tails or poly(T) heads (Additional file 9). Moran and colleagues [36] observed a similar phenomenon in the human genome and suggested that retrotransposons, such as LINE-1 elements, integrate into DNA lesions, resulting in retrotransposon-mediated DNA repair. We suggest that about 15% of retrotransposon SVs in the mouse genome formed through a similar mechanism involving DNA repair.

It is reasonable to assume that the complexities (micro-insertions and micro-deletions) we see at the breakpoints of ancestral deletions, inversions and gains (we call these 'complex' non-retrotransposon SVs) (Table 2) will correlate with a complex mechanism of formation. A DNA replication fork stalling and template switching/microhomology-mediated break-induced replication (FoSTeS/MMBIR) mechanism has been proposed to generate such complex SVs in the human genome [37]. In addition, about half of our complex non-retrotransposon SVs have microhomology (short sequence of identical bases) ranging from 3 to 25 bp (Additional file 9), compatible with a microhomology-mediated break-induced replication process. It could be that the complex non-retrotransposon SVs are also the progeny of mutational processes during DNA replication.

Interestingly, our estimate that 24% of SVs have micro-structures at their breakpoint is the same as that reported by Eichler and colleagues [30] in a study of human structural variation. Another sequencing-based study of SVs in two mouse strains (DBA/2J and C57BL/6J) examined 3,316 breakpoints and reported that 16% of non-transposon structural variants are complex, as defined by multiple breakpoints mapped to within 1 kbp of each other [29]. However, we were not able to directly compare these results to ours since we have not used the same classification criteria (we used a classification based on SVs being right next to each other, whereas Hall and colleagues' [29] was based on SVs being at close proximity).

Ideally, sequencing longer reads would typically be required to resolve the complex architecture of structural variants we report in this study, something that goes beyond the current generation sequencing platforms. Our findings offer an intermediate solution between next generation sequencing analysis and complete de novo assembly of genomes.

Materials and methods

Visual identification of PEM patterns

We visually inspected short-read sequencing data using LookSeq [38] and manually detected PEM patterns across mouse chromosome 19 in its entirety and a random set of other chromosomal regions, accounting for 100 Mbp of total genomic regions. We analyzed molecular architecture of these PEM patterns at nucleotide-level resolution using PCR and Sanger-based sequencing.

Experimental validation of SV architecture

Primers were designed using Primer3 [39] and purchased from MWG (Ebersberg, Germany). Primer design strategy was dependent on the type and size of the structural variant (Additional file 5). Three independent PCR reactions were carried out with Hotstar Taq obtained from Qiagen (Hilden, Germany). Reactions were performed as previously described [40]. LongRange PCR kit (Qiagen) was used for genomic regions > 2 kbp. PCR gel images were then assessed for quality of primer design and performance of PCR reaction. Representative PCR examples are given in Additional file 1. We provide interested readers with PCR results for each SV site (Additional file 7). Complex SV sites were subject to several rounds of primer design and PCR. PCR products were then purified in a 96-well Millipore purification plate, resuspended in 30 μl of H2O and sequenced as previously described [40]. All sequencing reactions were run out on an ABI3700 sequencer and assembled by using PHRED/PHRAP [41].

Assembly of Sanger-based sequencing data

Consed was used for visualization and editing of the assembly [42]. Strains with and without the SV were aligned into one contig. Breakpoint analysis was mostly based on visual inspection of the alignment and BLAT search. The first breakpoint was identified when the strains with and without the SV stopped aligning and the second breakpoint when they started aligning again. Micro-insertion at the breakpoint was defined as any additional sequence relative to the reference genome (MGSCv37) in the strains with a deletion variant. Micro-deletion at the breakpoint was defined as any missing sequence relative to the reference genome (MGSCv37) in the strains with insertion variant. We also recorded the longest stretch of sequence identity at both breakpoints, which we defined as sequence micro-homology. When micro-homology occurred, we recorded the SV so that start and end coordinates were the smallest. For blunt ended SV, one set of start and end coordinates was recorded. BLAT was used to get the exact start and end coordinates of the SV.

Genome-wide detection of structural variants

Genome-wide structural variants were detected using four methods: split-read mapping (Pindel) [20], mate-pair analysis (BreakDancer) [18], single-end cluster analysis (SECluster and RetroSeq, unpublished), and read-depth (CND) [21]. Details of the complete pipeline, SVMerge, is described elsewhere [43]. We used in-house Perl scripts to detect genome-wide complex structural variants [32].

Data

Data sets described in this study are available under study accession number 'estd185' from the Database of Genomic Variants Archive (DGVa) [44]. Our previous genome-wide data of structural variants [32] are also available from DGVa under accession number 'estd118'.