Background

From the development of high-throughput sequencing (HTS) technologies, the detailed variant detection is enabled for each individual with whole genome sequencing analysis. For single nucleotide variants (SNVs), various types of variant calling methods have been proposed [14] for HTS data, and the accurate SNV detection is archived for more than a thousand of individuals in genome-wide scale [5, 6]. However, unlike SNVs, there still exists difficulty in the accurate detection of structural variations such as genome insertion, genome deletion, short tandem repeat (STR) number polymorphisms, and copy number variations, especially from data with low read coverage [7].

For repeat number polymorphisms, several studies thus far reported associations with various disease phenotypes such as CAG repeats in the Huntingtin gene with Huntington’s disease [8] and CAG repeats in the androgen receptor gene with spinal and bulbar muscular atrophy [9]. From HTS data, several approaches such as lobSTR [10], RepeatSeq [11], STRViper [12], and coalescentSTR [13] have been proposed for estimating repeat numbers in STR regions. In lobSTR and RepeatSeq, repeat patterns included in sequence reads spanning the STR regions are considered for the estimation of repeat numbers. On the other hand, STRViper and coalescentSTR estimate repeat numbers by considering difference between the actual insert size and the insert size inferred from paired-end reads aligned to the flanking regions of the target repeat. The alignment quality of sequence reads is important for accurate repeat number estimation, especially in the former approaches although usual alignment methods have difficulty in STR regions due to insertions and deletions caused by the frequent change of repeat numbers.

We propose a new dynamic programming based realignment method named STR-realigner where repeat patterns in STR regions are given as prior knowledge, and repeat patterns are used multiple times in the realignment process. Although a similar algorithm is adopted in a tool for detecting STR regions in PacBio reads based on 3-stage modified Smith-Waterman [14], consecutive STR regions can be handled in the proposed algorithm unlike the tool. In addition, clipping fragments, which are an essential feature for the realignment, are also considered in the proposed algorithm. By allowing insertions and deletions of repeat patterns in STR regions with repeatedly use of repeat units, accurate realignment of sequence reads is expected.

In a simulation study with synthetically generated HTS data for artificial diploid genomes sequence based on phased genotypes of a sample in the dataset of 1000 Genomes Project [5], we showed the effectiveness of our model by evaluating root mean squared errors between true and estimated repeat numbers with RepeatSeq or allelotype, an STR calling software in the lobSTR package, from realignment results. For real data analysis, we applied STR-realigner, ReviSTER [15], and GATK IndelRealigner to HTS data from Illumina HiSeq 2000 for a HapMap CEU parent-offspring trio and show the effectiveness of STR-realigner based on consistency in Mendelian inheritance in the estimated repeat numbers in the parent-offspring trio. Root mean squared error values were also calculated from the comparison with the gold standard STR region size obtained from high coverage PacBio sequencing data for one of samples in the parent-offspring trio, and the results from the realigned sequencing data with STR-realigner showed the least (the best) root mean squared error value.

Method

Realignment algorithm considering repeat sequence as prior knowledge

We propose a dynamic programming based algorithm named STR-realigner that realigns query read R to a genome sequence, taking into account the multiple use of repeat patterns for prespecified STR regions. We consider a genome sequence comprised of series of m subsequences G 1,…,G m . Let B j be a binary variable that takes one if G j can be used repeatedly and zero otherwise, i.e., subsequence G j with B j =1 is for a repeat pattern in one of prespecified STR regions. Figure 1 shows an example of a genome sequence comprised of subsequences G 1,…,G 6, where G 2, G 3, and G 5 are repeat patterns of prespecified STR regions and are repeatedly used in the proposed realignment algorithm. In the description of the proposed algorithm, |R| and |G j | denote the size of R and G j , and R[ k] and G j [ k] denote bases at the kth position of R and G j , respectively.

Fig. 1
figure 1

An example of notations in STR-realigner. Subsequence with B j =1 can be used repeatedly in the realignment process

Since infinitely long deletions can be considered by using the same subsequence with B j =1 repeatedly, we limit the size of deletions to less than |G j | for subsequences with B j =1. We consider the following six types of states for the alignment of the ith position in query read R to the kth position of subsequence G j .

  1. 1.

    s M (i,j,k): a state representing match or mismatch between bases at the ith position of query read R and the kth position of subsequence G j .

  2. 2.

    s I (i,j,k): a state representing insertion at the ith position of query read R right after the kth position of subsequence G j .

  3. 3.

    s D (i,j,k): a state representing deletion of the kth position of subsequence G j right after the ith position of query read R.

  4. 4.

    s D (i,j,k,l): a state representing deletion from the kl+1 to kth positions of subsequence G j right after the ith position of query read R. This state is considered only for subsequences with B j =1 in order to avoid deletions longer than |G j | by limiting the range of l from 2 to |G j |−1. For l=1, s D (i,j,k) is used, and consecutive deletions in the same subsequence are not considered for s D (i,j,k) with B j =1. If l is longer than k, the deletion starts from the |G j |−lk+1st position on the subsequence and the deletion part rotates from tail to head of the subsequence.

  5. 5.

    s L (i): a state representing left clipping that ends at the ith position of query read R.

  6. 6.

    s R (i): a state representing right clipping that starts at the ith position of query read R.

The following penalties are considered in the proposed realignment algorithm.

  • p m,j : penalty for match of bases between query read R and subsequence G j . Usually, the penalty is set to a minus value, i.e., the penalty is used for rewarding.

  • p mis,j : penalty for mismatch of bases between query read R and subsequence G j .

  • p io,j and p ie,j : penalties for open and extension of insertion on subsequence G j , respectively.

  • p do,j and p de,j : penalties for open and extension of deletion on subsequence G j , respectively.

  • p c : penalty for clipping.

In the proposed dynamic programming algorithm, penalty and traceback information for state s are stored in functions P(s) and T(s), respectively. In the first step of the dynamic programming, penalty and traceback information of states for the first position in query read R are initialized in the following algorithm.

The best penalties for the alignment up to the ith position of query read R for each state is updated by using the best penalties of states for the i−1st position of query read R in Algorithm 2, where traceback information is also updated. Algorithm 3 given below updates penalty and traceback information for states representing match or mismatch. Algorithm 4 given below is used for obtaining states that are in preceding subsequences and can be traced from s M (i,j,1). Algorithm 5 given below updates penalty and traceback information for states representing insertion. Algorithm 6 given below updates penalty and traceback information for states representing deletion.

For subsequence G j with B j =1, consecutive deletions in the same subsequence are handled with s D (i,j,k,l), and hence s D (i−1,j,k−1) is not considered at step 6 of Algorithm 6 for traceback. Procedures for updating penalty and traceback information for states representing consecutive deletions for subsequence G j with B j =1 is given as Algorithm 7. Algorithm 8 given below updates penalty and traceback information for s R (i). Finally, an algorithm for traceback is given as Algorithm 9. By following states from head to tail in Q obtained with the above algorithm, the realignment result with the best penalty is obtained.

Figure 2 summarize a relationship of the above nine algorithms considered in STR-realigner as a flowchart. After initialization of penalty and traceback information for first query position with Algorithm 1, penalty and traceback information are updated for other query positions with Algorihtm 2 in a dynamic programming manner. Then, a realignment with the best penalty is obtained from traceback information with Algorithm 9.

Fig. 2
figure 2

A flowchart of algorithms considered in STR-realigner. After initialization of penalty and traceback information for first query position with Algorithm 1, penalty and traceback information are updated for other query positions with Algorihtm 2 in a dynamic programming manner. Then, a realignment with the best penalty is obtained from traceback information with Algorithm 9

Time and space complexities of STR-realigner

Time complexity analysis

For each position i in query read R, updating penalty and traceback information takes O(1) time for s M (i,j,k), s I (i,j,k), and s D (i,j,k) for k>1 and subsequence G j with B j =0. For k>1 and subsequence G j with B j =1, updating information for s M (i,j,k) and s I (i,j,k) requires O(|G j |) time while updating information for s D (i,j,k) and s D (i,j,k,l) requires O(1) time. For k=1, states for tail positions of preceding subsequences are additionally considered until reaching to subsequence G j with B j =0 or j=1 as in Algorithm 4. This process additionally requires \(O\left (\sum _{x={j'}}^{j} |G_{x}|\right)\) time for s M (i,j,1), where j is one or the index for the first subsequence Gj′ with Bj′=0 reached from G j . However, since the best state and its corresponding penalty before G j−1 are already considered for updating information for s M (i,j−1,1), by using this information, we need to newly consider only states in subsequence G j−1, and hence the additionally required time complexity is reduced to O(|G j−1|). Thus, with the modification of the algorithm according to the above argument, updating information for states s M (i,1,1),…,s M (i,m,1) requires \(O\left (\sum _{j} |G_{j}|\right)\) time in total. Since the same optimization can be applied to updating information for states representing insertion, updating information for states with k=1 requires \(O\left (\sum _{j} |G_{j}|\right)\) time in total as well. In addition, for s L (i) and s R (i), O(1) time and \(O\left (\sum _{j} |G_{j}|\right)\) time are required, respectively. Thus, updating penalties and traceback information for all the states requires \(O\left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\) time for each position in query read R, and hence the time complexity of the proposed algorithm is \(O\left (|R| \cdot \left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\right)\) time.

Space complexity analysis

The order of the number of states for each position in query read R is \(O\left (\sum _{j} |G_{j}|\right)\) for s M (i,j,k), s I (i,j,k), and s D (i,j,k). For s D (i,j,k,l), the order is \(O\left (\sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\), and for s L (i) and s R (i), the order is O(1). Thus, storing values from functions P and T requires \(O\left (|R| \cdot \left (\sum _{j} |G_{j}| + \sum _{j \in \{j'| B_{j'} = 1\}} |G_{j}|^{2}\right)\right)\) space. However, P(s D (i,j,k,l)) can be obtained by calculating P(s D (i,j,k))+(l−1)·p de,j , and T(s D (i,j,k,l)) is given by s D (i,j,k,l−1) for l>2 and s D (i,j,k) for l=2. Thus, the order of the space required for functions P and T can be reduced to \(O\left (|R| \cdot \left (\sum _{j} |G_{j}|\right)\right)\) by calculating functions P and T for s D (i,j,k,l) with O(1) time when their values are required. The space required for updating for each state is less than the order of the number of states and is negligible, compared to spaces required for P and T. Thus, with the above modification, the proposed algorithm requires \(O\left (|R| \cdot \left (\sum _{j} |G_{j}|\right)\right)\) space.

Practical implementation

Irregular repeat patterns are often contaminated in the provided STR regions detected by some Bioinformatics tools [16, 17], and those irregular repeat patterns worsen the quality of the alignment of the proposed algorithm due to the difference of the actual sequence and the assumed repeat pattern. In order to address this issue, we extract maximal regions containing repeat patterns consecutively with some pre-specified error rate from the target STR region. The extracted region is used for a new target STR region for STR-realigner.

In order to use the realignment result from the proposed algorithm for resequenced data, parts of the query read aligned to G j with B j =1 are again realigned to the corresponding STR region of the reference genome. However, the quality of the alignment is also worsened due to irregular patterns in the STR region. Thus, we consider a subsequence for a repeat pattern right after the target STR region and set lower deletion penalty to the target STR region. For penalty, the following setting were used in our study: p m,i =−1, p mis,i =4, p io,i =6, p ie,i =1, p do,i =6, p de,i =1, and p c =5. These parameter values are the same as the default values in BWA-MEM. For subsequences corresponding to target STR regions for lower deletion penalty, p do,i is set to 4.

In Illumina reads, bases at positions after homopolymer regions are highly erroneous because the same phasing is accumulated in synthesis during the Illumina sequencing process in homopolymer regions. Figure 3 shows an example of erroneous bases around a homopolymer region where a lot of clippings occur around a long homopolymer comprised of A bases in GRCh37 due to sequencing errors. Since sequence reads with such highly erroneous bases worsen the quality of realignment with STR-realigner, we additionally implemented an option that skips the realignment with STR-realigner for homopolymer regions with some specified size such as 15.

Fig. 3
figure 3

An alignment result around a homopolymer region. Most of the reads spanning the region contain soft clipping parts due to drastic sequencing errors after the homopolymer region

Each mapping tool has its specific characteristics in the aligned reads. For example, a deletion exists in the start position of an STR region in the reads aligned with some mapping tool while a deletion exists in the end position of the STR region in the alignment result of another mapping tool for the same sequence reads. The performance of variant calling is worsened if such characteristics are mixed in the alignment results. Thus, all the reads aligned to a target STR region are realigned with STR-realigner in the default condition.

Results and discussion

Simulation analysis

From a list of STR regions provided in the RepeatSeq software package, we extracted STR regions for evaluation as follows:

  • STR regions not in chromosome 22 were filtered out.

  • STR regions with size longer than 100 bp were filtered out.

The maximum period, the size of repeat pattern, in the list is six. Since the length of sequence reads considered in the following experiments is 100 or 101 bp and these sequence reads cannot span STR regions > 100 bp for most of the cases, STR regions > 100 bp were filtered out. We then prepared synthetically generated diploid genome sequences of chromosome 22 based on phased genotypes for a CEU individual, NA12286, in the phase3 phased reference panel by the 1000 Genomes Project [18]. In the generation of the above genome sequences, variants located in the extracted repeat regions were ignored. The number of variants in total is 54,897. By randomly sampling repeat numbers, we generated two sets of repeat numbers for the extracted repeat regions and added STR variants to the diploid genome sequences based on the sets of repeat numbers for the evaluation. Note that repeat numbers with which the size of STR region is > 100 bp were avoided in the random sampling process. From the diploid genome sequences, we generated paired-end sequence reads in FastQ format with the read length of 100 bp and the insert size normally distributed with mean of 500 bp and standard deviation of 50 bp. In the generated reads, substitution errors were added with rate of 0.1%. Base quality scores for bases in FastQ format were set to Q30, which corresponds to 0.1% error. The read coverage of the generated data is 40 ×. A BAM file for the dataset was obtained by mapping the sequence reads to the reference genome (GRCh37) with BWA-MEM (0.7.12-r1039) [19]. We applied our proposed realignment method, STR-realigner, ReviSTER (0.1.7), and GATK IndelRealigner (GATK 3.4-0) to the BAM independently and generated three types of BAM files.

For GATK IndelRealigner, USE_READS was used for --consensusDeterminationModel option. RepeatSeq (v0.8.2) was applied to the original BAM file and the three types of realigned BAM files, and sizes of variants in the target STR regions were obtained. Table 1 shows call rates of results from RepeatSeq using the original BAM file and the three types of realigned BAM files. The call rate indicates the rate of results with STR region size estimated as a non-NA value. For all the STR periods other than period of 1, call rates of results from the BAM file realigned with STR-realigner are higher than those from other BAM files.

Table 1 Call rate of STR calling results with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. The best result is underlined

Table 2 summarizes the root mean squared errors (RMSE) between estimated and true STR region size for each BAM file for all the STR regions. In the calculation of RMSE, the size in the reference genome was assigned for the region size estimated as NA value. For all the STR periods other than period of 1, the results from the BAM file realigned with STR-realigner show the best RMSE value. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file. In order to examine the performance excluding the results estimated as NA value, we summarized RMSE for STR regions where results were commonly estimated as a non-NA value on all the four types of BAM files in Table 3. Similarly to the results for all the STR regions, the results from the BAM file realigned with STR-realigner show the best RMSE value for all the STR periods other than period of 1. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file. We also applied allelotype (4.0.0) [10], an STR calling software in lobSTR package, to the original BAM file and the three types of realigned BAM files. Tables 4, 5 and 6 show call rates, RMSE values averaged on all the regions, and RMSE values averaged on commonly called regions for results from allelotype, respectively. In these tables, the results for the original BAM file realigned by allelotype with --realign option are also included. The results for the BAM files realigned with STR-realigner and ReviSTER gave the highest call rate for all the STR periods other than period of 4 and the case considering all the periods. The results for STR-realigner gave the best RMSE value for STR periods of 2, 4, and 6 and the case considering all the periods for commonly called regions although the results for STR-realigner are slightly worse than those for ReviSTER in total for Table 5.

Table 2 Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions. The best result is underlined
Table 3 Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for commonly called STR regions. The best result is underlined
Table 4 Call rate of STR calling results with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option. The best result is underlined
Table 5 Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions. The best result is underlined
Table 6 Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 40 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for commonly called STR regions. The best result is underlined

In order to examine the performance on lower coverage data, we downsampled the original BAM file from 40 × to 10 × and estimated repeat numbers with RepeatSeq and allelotype using the downsampled BAM file and BAM files obtained by applying the realignment methods to the downsampled BAM file. Table 7 shows call rates of results from RepeatSeq for the downsampled BAM files. For all the STR periods other than period of 1, call rates of results from the BAM file realigned with STR-realigner are higher than those from other BAM files. Tables 8 and 9 respectively summarize RMSE values for each BAM file for all the STR regions and the regions where results were commonly estimated as a non-NA value. In the calculation of RMSE, the size in the reference genome was set for the region size estimated as NA value for Table 8. In both Tables 8 and 9, the results from the BAM file realigned with STR-realigner shows the best RMSE values for all the STR periods other than period of 1. The RMSE value from the results based on the BAM file realigned with GATK IndelRealigner is slightly better than that based on the original BAM file.

Table 7 Call rate of STR calling results with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner. The best result is underlined
Table 8 Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions. The best result is underlined
Table 9 Root mean squared error (RMSE) between true and estimated repeat numbers with RepeatSeq using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for commonly called STR regions. The best result is underlined

Tables 10, 11 and 12 show call rates, RMSE values averaged on all the regions, and RMSE values averaged on commonly called regions for results from allelotype, respectively. The results for the BAM file realigned with STR-realigner gave the highest call rate for all the STR periods other than period of 1. In total, STR-realigner gave the best results in both considering all the STR regions and commonly called regions.

Table 10 Call rate of STR calling results with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option. The best result is underlined
Table 11 Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions. The best result is underlined
Table 12 Root mean squared error (RMSE) between true and estimated repeat numbers with allelotype using the original BAM file of 10 × and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for commonly called STR regions. The best result is underlined

The results for the sequencing data of 10 × are always worse than those of 40 × in all the cases.

Real data analysis

For real human sequencing data, we used 101 bp paired-end sequencing data of a CEU parent-offspring trio NA12878, NA12891, and NA12892 analyzed in the 1000 Genomes Project. NA12891 and NA12892 are parents of NA12878. The data was sequenced on Illumina HiSeq 2000 with the read coverage of 50 × and the average insert size of 300 bp. Sequence reads were mapped to the reference genome (GRCh37) with BWA-MEM and stored in BAM format. The data was obtained from the Illumina Platinum Genomes Project through the European Nucleotide Archive under the study accession PRJEB3381 (http://www.ebi.ac.uk/ena/data/view/ERP001960). STR-realigner, ReviSTER, and GATK IndelRealigner were applied to these BAM files. RepeatSeq was then applied to the original BAM files and these realigned BAM files and sizes of the target STR regions were estimated. In order to examine the performance, we considered the consistency in Mendelian inheritance in called regions, where the estimated region size for NA12878 is a non-NA value. STR regions used for the evaluation are the same as the regions in simulation analysis in Section 1. We counted STR regions where the estimated size for NA12878 consistent with those for her parents, NA12891 and NA12892 in terms of Mendelian inheritance as well as the STR regions with inconsistent results.

In Table 13, the number of regions with consistent sizes in terms of Mendelian inheritance (#CR) and the number of inconsistent estimation results (#IR) are summarized. Note that the larger number is better for consistent regions while the smaller number is better for inconsistent regions. The results for STR period of 1 without skipping homopolymer regions with size >15 are in parentheses.

Table 13 The numbers of estimated repeat numbers matched and mismatched with parents in terms of Mendelian inheritance

For STR periods of 3, 4, and 6, results from BAM files realigned with STR-realigner gave the best results in both consistent and inconsistent regions.

In total, results from BAM files realigned with STR-realigner gave the best results in both consistent and inconsistent regions. Results for GATK IndelRealigner are consistent in more regions than those for the original BAM files although the results for GATK IndelRealigner contains the most inconsistent regions.

In Table 14, the number of regions with consistent sizes in terms of Mendelian inheritance (#CR) and the number of inconsistent sizes (#IR) on results estimated with allelotype are summarized. For STR periods of 2, 3, 5, and 6, results from BAM files realigned with STR-realigner gave the best results in consistent regions. In addition, for STR periods of 2 and 5, results from BAM files realigned with STR-realigner also gave the best results in inconsistent regions. In total, the results for STR-realigner gave the best performance in both consistent and inconsistent regions.

Table 14 The numbers of estimated repeat numbers matched and mismatched with parents in terms of Mendelian inheritance

Figure 4 shows an IGV view where STR-realigner effectively works on realigning inserted repeat patterns in an STR region comprised of GGAT repeats located at chr22:28045335-28045407 for a BAM file for NA12892. In the original BAM file, there exist clipping fragments in some reads due to the insertion of GGAT repeats, and the insertion of GGA repeats was missed with RepeatSeq, which caused the inconsistency of the estimated results in Mendelian inheritance. On the other hand, sequence reads with left clipping disappear and insertions are observed in the BAM file realigned with STR-realigner. In the estimated result with RepeatSeq, the insertion was included in the estimated size of the STR region.

Fig. 4
figure 4

Comparison of original and realigned BAM files for NA12892 in an STR region located at chr22:28045335-28045407. The top panel is a plot of sequencing data in a BAM file realigned with STR-realigner and the bottom one is a plot of sequencing data in the original BAM file

We also evaluated root mean squared errors for NA12878 by using STR region sizes obtained from high coverage PacBio sequencing data as the gold standard. The PacBio sequencing data was obtained through BioProject PRJNA253696 with Sequence Read Archive accession numbers SRX627421 and SRX638310 [20] and its read coverage is 46 × in total. Error-corrected reads with Falcon (https://github.com/PacificBiosciences/FALCON) in FASTA format were aligned to GRCh37 with BWA-MEM, and STR region size data was then obtained from variant calling results by applying allelotype to the aligned reads.

Table 15 summarizes the RMSE values between estimated STR region size with RepeatSeq and the gold standard for each BAM file for all the STR regions. In the calculation of RMSE, the size in the reference genome was assigned for the region size estimated as NA value. RMSE values for estimated results with allelotype are also summarized in Table 16. In both cases using RepeatSeq and allelotype, the results from sequencing data realigned with STR-realigner showed the best performance in total.

Table 15 Root mean squared error (RMSE) between the gold standard and estimated STR sizes with RepeatSeq using the original BAM file for NA12878 from HiSeq 2000 and those realigned with STR-realigner, ReviSTER, and GATK IndelRealigner for all the STR regions
Table 16 Root mean squared error (RMSE) between the gold standard and estimated STR sizes with allelotype using the original BAM file for NA12878 from HiSeq 2000 and those realigned with STR-realigner, ReviSTER, GATK IndelRealigner, and allelotype with --realign option for all the STR regions

Comparison of computational time

Table 17 shows the computational time of STR-realigner, ReviSTER, and GATK IndelRealigner for simulation and real data analyzed in Sections 1 and 1. For the real data, the computational time for realigning the BAM file for NA12878 was measured. Computation was conducted in Intel Xeon CPU E5-2670 processors with a single thread, and computational time for each case is for single process. STR-realigner is implemented in Java. In both simulation and real data, ReviSTER required the most computational time among these methods, and STR-realigner required more computational time than GATK IndelRealigner. For STR-realigner, by limiting alignment space within some window of the original alignment result, the drastic reduction of the computational time is expected while keeping its realignment quality. In addition, the computational time can be reduced by realigning sequence reads for each STR region in a parallel manner. For memory consumption, STR-realigner requires less than 2GB in both simulation and real data.

Table 17 Comparison of computational time on a simulation data for an individual with read coverage of 40 × and a real dataset for NA12878

Conclusion

We proposed a new realignment method for STR regions named STR-realigner that takes sequence reads aligned with other methods and realigns sequence reads by dynamic programing manner with the consideration of the corresponding STR repeat pattern as prior knowledge. For the simulation data analysis, we prepared synthetically generated reads aligned with BWA-MEM, those realigned with STR-realigner, those realigned with ReviSTER, and those realigned with GATK IndelRealigner. In order to evaluate the effectiveness of our proposed method, we applied RepeatSeq and allelotype to these four types of aligned reads, and the results from sequence reads realigned with the proposed method give the best RMSE value among the results from these four types of aligned reads. From the comparison of root mean squared errors between estimated and true STR region size for these four types of aligned reads, the results for the dataset realigned with STR-realigner are better than those for other datasets for most of the cases. For real data analysis, we considered a real sequencing dataset from Illumina HiSeq 2000 for a parent-offspring trio, RepeatSeq was applied to an aligned sequencing dataset with BWA-MEM, that realigned with STR-realigner, that realigned with ReviSTER, and that realigned with GATK IndelRealigner, and the results from the dataset realigned with STR-realigner shows the best performance in terms of consistency of the size of estimated STR regions in Mendelian inheritance. In addition, by using the size for STR regions obtained from high coverage PacBio sequencing data as the gold standard, the results for STR-realigner show the best RMSE values for the case considering all the periods. In both simulation and real data, ReviSTER required the most computational time among the realignment methods considered in this work, and the proposed method required more computational time than GATK IndelRealigner. However, the computational time of STR-realigner can be reduced drastically by parallel computing and limiting the search space for the realignment around the originally aligned results in some specified window size.