Keywords

1 Introduction

In recent years, studies have shown that a diverse array of genetic variation occur in the human genome. Studies on these variations will not only help to reveal a large number of complex diseases and to find the genetic mechanism associated with individuals but also speed up the pace of personalized medicine. The genetic variation is simply classified into two kinds: single nucleotide polymorphism (SNP) and structural variation (SV). Since the last quarter of the twentieth century, SNP has long been regarded as the most common genetic variation in human genomic and widely studied [1], and identified by traditional PCR-based methods. On the other hand, several types of genetic variations, including insertions or deletions and copy number variations (CNVs) are also widespread in human genomes [2], which have more significance in several areas of biology such as the studies of obesity diseases [3], cancer genome [46], and molecular evolution [7, 8]. Especially, the discovery of CNVs, present in the human genome, has changed dramatically our focus on structural variations and phenotype association with diseases. Recent review has revealed that SVs, especially CNVs have extended along over 1000 genes; in addition to, CNVs often encompass a large proportion of the genome, to a great extent than SNPs, ranging from ~ 12 % of the human reference genome [9, 10]. This proves that CNVs is more responsible for genetic diversity and evolution between human populations. CNVs are just one class of SVs, which are defined in terms of the size of insertions and deletions (> 1 kb). The recognized types of SVs (are shown in Fig. 1) contain indels, inversions, copy-number variations (CNVs) and translocations. Term “indels” means that the amount of inserted or deleted genes is smaller than 1 kb. If the amount is larger than 1 kb, these types of structural variations are referred to as copy number variations, that is, the large genomic segment of duplications or deletions.

Fig. 1.
figure 1figure 1

Several types of structural variations [17].

Until recently, the methods used to detect SVs mainly have microarray-based technology [11], the fluorescent hybrid technology [12], Multiple PCR technology [13] and Sequencing-based technology [14]. The earliest methods are based on micro-array platforms such as the oligonucleotide-based microarray comparative genomic hybridization (array-CGH) [15] and bacterial artificial chromosome (BAC) [16]. Although these computational approaches based on array data, were successfully used to identify CNVs and other types of SVs like translocations. They also have some limitations. For example, Array-CGH method cannot detect chromosomal translocations or inversions owing to a limited dynamic range; furthermore, the prediction of breakpoint resolution is controlled by the density of the array. Sequencing-based method is emerging recently like Sanger sequencing, which was in place to identify genomic variants in the human genome, while the objectionable feature of Sanger sequencing is too expensive and time-consuming. Compared with the conventional microarray-based method and Sanger sequencing, NGS has an overriding strength on cost-effective and high-throughput. It has driven the development of NGS-based technologies for detecting genetic variations in the human genome.

There are many NGS-based detection algorithms for SVs springing up, have enabled extensive SVs detection. It basically contains the following mainstream algorithms: PEM-based method, read-depth method, split-read method and sequence assembly method. The split-read method generally was combined with PEM-based method and used in various detecting SVs tools. In general, the identification and detecting for SVs involve these steps: Firstly, aligning the short sequencing reads to a given reference genome; then finding the interest regions that different from the reference which is likely being a potential SV; finally, verifying these variations by some strategies.

In this survey, we will describe currently algorithms for detecting SVs by using next-generation sequencing, which have roughly classified into three types. And then, we will discuss the strength and the weakness of these methods, lastly we will provide an outlook for the future research development and make a summary of this article.

2 Algorithms for Structural Variations Detection

Paired-end reads, and mate-pair reads are two distinct reads generated by sequencing-based technologies and two disparate strategies at a known distance. The difference between them is that the length of fragments of the paired-end reads is shorter than those of mate pairs. The length of these read is restricted by the space of the slide, so we can distinguish according to the length of them. The first strategy is the circularization of the DNA segments within the sequencing process; the generated reads with a long insert size are better for detecting a large SV. Another way is obtained from both ends of a segment of DNA, whose sizes are approximately known; that is, the insert size. This method can obtain a high resolution during detecting a small SV. In this survey, these two reads are called “pairs-read” unified.

PEM-based methods identify SV breakpoints by aligning the short paired-end reads to the reference genome to find the ‘discordant’ with the reference genome, which is probably being one class of SVs. The paired-end reads comes from a sequencing library which contains plenty of fragments with a known length. The ‘discordant’ paired-end reads are either the expected distance or orientation divergence. In the aligning or the examining strategy, a mapping signature can be produced, which indicate the presence of SVs. So we should discuss the mapping signature first.

2.1 Signatures Based on PEM

The earliest two signatures were insertions and deletions (Fig. 2a and b). Figure 2 shows two types of read signatures; one is the paired-end reads, and the other is split-reads. Since most of the current methods use the fusion of these two reads to detect SVs, we simplify to categorize it into the PEM-based methods.

Fig. 2.
figure 2figure 2

Interpretation of PEM signatures [18].

Term “ref” means that an original genome. A ref case often is used as a control genome; similarly, term “donor” represents a group genome, which comes from the sequencing process. An inversion pairs-read, spanning either of breakpoint of an inversion, will map with an orientation opposite to the reference (Fig. 2c). Figure 2d is the most obvious discordant paired-end read, which occurs within a chromosome. In a case that two mapped reads beyond its expected distance, and the other read of both appear on another chromosome (Fig. 2e). Note that cases from Fig. 2e to Fig. 2i are more complicate than the simple deletions and insertions.

Tandem duplication is the ordinary case of SVs; these pairs-read were linked from the end of the duplication part to its beginning (Fig. 2f). A linking case is that the two distant reads of the ref genome are very close lying on the donor, in other words, comparing with the ref the orientation and the order of these pairs-read remain un-changed, while the distance between them after mapping is fewer than the distance on the ref genome (Fig. 2g). A somewhat akin to case Fig. 2g while more complex situation is that a distant mobile element or segment was inserted into a donor genome, resulting in a linked insertion and the distance between pairs-read closer than before (Fig. 2h). Sometimes a long segment was embedded in a donor genome, longer than the insert size; a hanging insertion signature is formed with one read unmapped (Fig. 2i). A lengthy piece was inserted into the ref, resulting in a hanging deletion signature akin to Fig. 2i with another read of both pairs read unmapped on the donor genome (Fig. 2j).

Split-read mapping also has multiple types of SVs. For an insertion, the prefix and suffix of split-read are mapped to a neighbor location, whereas the intermediate region is an inserted segment (Fig. 2l). In a case of deletion, the prefix and suffix of split-read are mapping around the breakpoint neighbor with each other (Fig. 2k). An interspersed duplication is the case that a segment from another location of the donor genome shifts to one end of split-read and link (Fig. 2o). A similar case is the mobile-element insertion (Fig. 2m). And contrary to Fig. 2m, the orientation of mobile-element is opposite to its original orientation in a reference genome (Fig. 2n). A tandem duplication case is similar to the case described in Fig. 2f (Fig. 2p).

2.2 Methods Using PEM

In the process of technology development, many algorithms and tools based-on PEM have been proposed and designed for SVs (Table 1). There are two classes of strategies were utilized to detect SVs, the distribution-based method and the clustering-based method. The main step of clustering-based method is to label the concordant and discordant pairs, and next to call the underlying SVs by using current clustering approaches. Only if the orientation of pairs-read is same as the reference genome and the distance of pairs-read match the expected distance, it can be defined as the concordant, otherwise is discordant. For example, Mateo et al. [19] used a SVM model to cluster the local pattern of mapping read and after that predicted the position of SVs. Korbel et al. [20] and Tuzun et al. [21] first labeled the signature of PEM and then clustered the discordant together, only if the number of clusters is higher than a specified value, it can be identified as potential SVs.

Table 1. Tools of PEM-based method [16]

These methods are related to two parameters: the number of standard deviations which can determine whether a pairs-read is discordant and the minimum number of pairs-read to define a cluster. These factors are interconnected and associated to the coverage, in other words, the coverage and the number of pairs-read or the number of standard deviations is an inverse relationship.

One of the weaknesses of the clustering approach is ignoring the case that many multiple mapping sites can match the pairs-read, so detecting the signatures within the repeat regions in the genome is a tough work. However, the region of repeat is strongly associated with the duplication read, so various methods were designed to address this issue. The adopted optimization processes are to select a ‘good’ cluster with the max support for each pairs-read.

Another deficiency is the clustering method used an unchanged critical value for the number of standard deviations after a signature of PEM is considered as a discordant. When the threshold of discordance changes mapped distance of PEM signatures from 2 s.d to 1 s.d, spanning the same breakpoint, there are no clusters be set up. While Lee et al. [22] successfully solves this problem by proposing a distribution-based method, which allows the distribution of all the map-ping around a known breakpoint to be visualizing. If the mapping distribution corresponds with the distribution of expected insert size, while the orientation is opposite, then an indel cluster was set up. Despite this method is good at detecting much smaller indels than the clustering-based method, it also leads to other problems such as the rare variants appeared between homozygous and heterozygous, the power of detecting is not always reliable.

2.3 Signatures Based on Depth of Coverage

Unlike the signature of PEM, there are only two cases happened on the based-depth of coverage. One case is the copy-number duplication; that is, the frequency of read fragment in the donor in some region is higher than in the ref genome; another situation is the copy-number deletion. The density in this region is lower than in the ref. These two cases are shown in the Fig. 3.

Fig. 3.
figure 3figure 3

The illustrations of DOC signatures [31]

2.4 Methods Using Read-Depth

Most of read-depth methods usually partition a genome into the tag count windows and non-overlapping windows. The general procedure of these methods is to determine the region which tags counts are notably different from the normal counts in the genome. These strategies have got an admirable accuracy in detecting large CNVs. Despite the strength of sensitivity and specificity of these methods arising with the size of CNVs, they only customized for the dosage changing SVs; in other words, the range of these methods can be detected not include translocations and inversions, just CNVs and indels.

There are numerous investigations on the based-depth of coverage, and a number of tools have been developed (Table 2). For instance, Xie et al. [32] proposed one method to segregate the genome of a small fixed size window and then find out that the window of the case genome which notably distinct from the reference genome. The resolution of these methods is related to the size of the window. That is to say, too small will weaken the detecting power; too large may lose the resolution.

Table 2. Tools of read-depth-based method [17]

The deficiency of these methods is the factor resulting in abnormal tag count is uncertainty. For instance, the sequencing error rate of NGS such as the poor or rich region of GC is lower than the average GC; it may cause a potential loss or gain of read; moreover, a read mapped by mistake will make the discovery of signature of DOC more complicate.

2.5 Methods Using Sequence Assembly

As the signatures are more and more complicate, people are attaching much importance to the local de novo. Recently, local de novo has been rapid development, just as its name implies, it just finds out the local region which differs from the reference genome, and then will be cut out to reassemble from the set of reads. Comparing the de novo assembly by using all the reads, local de novo methods have enabled the computational time reduced greatly. There also have been several types of SVs, and we will give an illustration in the following (Fig. 4).

Fig. 4.
figure 4figure 4

The illustrations of assembly signature [31]

Given a case that a larger insertion reads lies on the donor genome, the number of matching bases is fewer, or a larger deletion occurs on the donor genome, the mapping signature will more complicate as we described on the signature of PEM. As we point out, PEM-based method is difficult in detecting these cases, since the matched bases are fewer and no enough evidence to use for detecting. Although in later some “soft clip” of PEM-based methods can tackle with this problem, but result is inefficient. So we should consider another alternative way to recover them for detecting. With the development of next-generation sequencing, some reassembly methods for variation detection are emerging to become a popular alternative method such as the micro-assembly methods. Its main idea is that to perform localized de novo, to detect the region encompasses potential SVs and perform assembly, finally to remap a contig, created by assembly from the reads set, to the reference genome. Table 3 shows the recent developed tools that based on local de novo. In this table, signal ‘*’ means that there are no published literature about this tool, but we can learn more detail on its website on guide section.

Table 3. Tools of the local de novo method based on de Bruijn graph [39]

These methods are roughly similar to each other except the way to handle the cycle in the graph. For example, Scalpel has a high accuracy in detecting repeat region via utilizing a self-tuning k-mer size approach, and a deeper analysis of the rich repeat region used by Scalpel can avoid the cycle path generated. GATK Haplotype-Caller is akin to Scalpel to improve the accuracy of indels detection through larger the k-mer size gradually, while the detected accuracy is weaker than Scalpel on account of ignoring the approximately matching repeat sequences. SOAPindel uses another strategy to form a non-cycle path from unused reads by reducing k-mer sizes. The k-mer size of TIGRA can be specified by the user, and this tool is just designed for breakpoint detection other than finding out repeat regions. ABRA has the same technique processing with Scalpel, which utilizing an increasing k-mer size to generate a non-repeat path except the k-mer reached the upper bound, while the scope of the assembly is no more than 2 kb. The scope of Platypus can assemble is 1.5 kb, smaller than ABRA. Different from the above-mentioned methods, Bubbleparse adopts a Cortex framework to implement indels detection, but the result of a high false-positive rate is not satisfactory.

3 Discussion

Recent studies have shown that SVs are prevalent as SNPs in the genome. SV has become a hot area of biomedical researchers, and the precise identification of SV will accelerate the research of mechanisms related to human genetics or complex diseases. It is virtually certain that some new algorithms and experimental schemes for detecting SV will continuously arise in the future. The NGS-based method has provided numerous opportunities to mutation detection. Although these methods described above have some strength, they also have their scope of application. For instance, read-depth methods can achieve good accuracy in detecting CNVs and indels, but the power of detecting dosage-unchanged mutation is poor. PEM-based methods may be difficult in looking for the precise position of breakpoint; furthermore, its performance is dependent on the completeness of the pairs-read. For example, if a larger insertion or deletion emerges around the breakpoint and the matched bases are fewer, the accuracy of SVs detection will decrease. The PEM-based method and read-depth method have their respective strengths, so the combination of them may have a more satisfactory result. So we can consider the fusion of multiple methods or strategies, because a single method or strategy is too plain for detecting composite variation, and the fusion can utilize more information. Now some tools also comprehensively use two or more strategies such as algorithm in BreakDancer which has combined clustering-based strategy and distribution-based method, and SVseq has been fused with the PEM-based method and split-read method. Integrating various detection algorithms is becoming a popular strand for SVs.

Another difficulty of SV detection is that the optimal value of the parameter is hard to ascertain. For instance, the PEM-based method has two parameters, the standard deviation to determine whether the pairs-read is discordant and the minimum number of the pairs-read. The first parameter is associated with the distance mean, which is a fixed value and too reliant on the experience. It is vital that to make the parameter self-tuning based on some adaptive technology.

With the development of sequencing techniques, the structural variation algorithm is suffering from the problem that how to adapt to the new characteristics of the sequenced data. Regardless of the fact that short data generated by NGS-based technology can well be utilized for detecting SVs, some large insertions or deletions cases also happen in the genome. Moreover, other technologies such as Sanger and Roche have generated a long read data. To use a long read is becoming a feasible strategy on many platforms in the future. So the related research is also required.