Background

Microsatellites, or simple sequence repeats (SSRs), are tandemly repeated DNA sequences with a period of from 1 to 6 base pairs [1]. It is sometimes referred to as a variable number of tandem repeats or VNTRs. An SSR which contains one type of repeats, is called a simple SSR (e.g. (CA)15) and those which have more than one type are called compound SSRs (e.g. (CA)8(CG)12) [2]. The repeat units are generally di-, tri- tetra- or pentanucleotides. They are commonly found in non-coding regions of the genome.

SSRs are highly mutable loci [3]. In animals, observed SSR mutation rates have been of the order of 10-3 to 10-4 for autosomal repeat loci [4, 5] (Wiessenbach et al. 1992; Weber and Wong 1993). However the average of mutations in SSR loci is 10-2 in one generation [6].

Chistiakov et al. [7] suggested that two mechanisms are responsible for the high mutability in SSRs. First, motif repetition makes SSRs prone to mutation by DNA polymerase slippage during replication because of the multi-complementary sequences, and second, unequal crossing over or related processes [811]. The slippage rate is correlated to SSR length and this makes longer SSRs more variable than shorter ones [12, 13]. However, there is no threshold length for slippage mutations [14]. The mutations that happen because of the polymerase slippage could be considered as special types of insertion/deletion (INDELs) mutations that usually occur when adding or erasing sequences without any substitution. Substitution is considered as another kind of mutation called single nucleotide polymorphism (SNPs). In general, SNPs occur much more frequently than INDELs [15]. But SSR replication slippage generates more genetic change in eukaryotes than do all base substitution per generation [16], so it increases the frequency of INDELs. In addition, it has been reported that the perfect SSR motifs are significantly more variable compared to imperfect repeated motifs [17, 18].

The power of SSR regions relies on their high abundance in the genome, codominant nature, extensive genome coverage, and high polymorphism [19]. The polymorphism of SSR depends on the differences in the numbers of repeated units between alleles at a single locus. The SSRs are used as molecular markers in a wide range of applications, such as genome mapping, marker assisted selection, gene tagging, and evolutionary and diversity studies [20] The main feature of SSRs that makes them amenable for use as molecular markers is that the flanking regions are highly conserved, allowing the use of specific PCR primers to amplify the same SSR even across different taxa [21, 22].

Sequence alignment involves the identification of the correct location of INDELS that have happened since their divergence from a common precursor. The true alignment reflects the evolutionary relationships between the sequences accurately. Nevertheless, in the case of a compound SSR region, the general alignment will show many overlaps between the different units of repeats, which seem biologically incorrect because of the replication slippage mutations rate. This suggests a need to re-evaluate the general alignment methods and their parameters. In this paper, we surmise that correct alignment should put the repeats separately without overlapping between them and without changing the alignment parameters. We suggest the incorporation of a simple algorithm for the shifting process of SSR loci after applying the usual alignment used in regular software.

Findings

Algorithm

In this paper, we compare our new algorithm for SSR alignment with the common alignment algorithms used in other programs. The new algorithm (Figure 1) would deal with the SSR according to the following major steps:

Figure 1
figure 1

The SSR aligning algorithm.

1- User must identify the following items:

  1. a.

    Data set file

  2. b.

    Repeated units

  3. c.

    SSR length (first and last nucleotide)

2- Identify the sequences that do not match the first repeated unit from the beginning of the selected SSR region

3- Do this for each repeated unit

  1. a.

    Put the tandem repeat in a temporary array

  2. b.

    Check if the next nucleotides match the next repeated unit

  3. c.

    If not, put the unmatched nucleotides in another temporary array

  4. d.

    Fill the gaps to the longest sequence of the repeats in the same array

  5. e.

    Merge the temporary arrays

4- Put your results instead of the SSR region.

See the additional file 1: SALT.swf. An animation describes the algorithm.

Testing and Implementation

The sequence case A contained a simple SSR with the tandem TA, which represents 15.4% of the whole sequence. After applying the alignment in the MEGA 4 alignment and our modifications, one major difference was shown clearly in the gap sites in some sequences (Figure 2). However, these differences did not reveal variations in the phylogenic tree before and after applying the new algorithm, and the whole sequence length equals 351 bp in both cases (Figure 3).

Figure 2
figure 2

Part of the case A sequence which contains the SSR region before (A) and after (B) applying the algorithm.

Figure 3
figure 3

Case A phylogenic tree before (A) and after (B) applying the algorithm.

The sequence case B contained a compound SSR with the tandem TA and CA, which represents 25.2% of the whole sequence. The length was increased from 397 bp to 413 bp after applying the new algorithm. However, the phylogenic trees indicated that 50% of the samples showed a similar cluster before and after the new algorithm being applied (Figure 4).

Figure 4
figure 4

Case B phylogenic tree before (A) and after (B) applying the algorithm.

The sequence case C contained a compound SSR of TA, CA, and CG tandem repeats representing 35% of the whole sequence. Applying our new algorithm for case C increased the length of the sequence from 457 bp to 478 bp. However, the comparison of the phylogenic trees before and after applying the new algorithm showed that only seven samples, 26.9% of the whole sequence, clustered similarly (Figure 5).

Figure 5
figure 5

Case C phylogenic tree before (A) and after (B) applying the algorithm.

The sequence case D contained compound SSR (TA, CA, CG, and TG). The length of this tandem repeats represents 38% of the whole sequence. The whole sequence length was changed after the new algorithm was applied from 479 bp to 539 bp. The cluster analysis resulted in completely different phylogenic trees before and after applying the new algorithm (Figure 6).

Figure 6
figure 6

Case D phylogenic tree before (A) and after (B) applying the algorithm.

The overall pairwise value (PV) for cases A, B, C, and D before applying the new algorithm indicated that these values were increased whenever the sequence contained more repeated units (Figure 7). In contrast, the PV was decreased after the new algorithm was applied to the same sequences. Applying the new algorithm showed a more stable distance by preventing the overlaps between different linages, although it has a slight decrease, which may be attributed to the additional aligned repeated unit, The additional units increased the SSR length giving more similarity because it does not contain overlaps or mismatches and the only difference between alleles is the opening gap position. The interval values between the two PV (before and after applying the new algorithm) were increased for the cases A, B, C, and D, indicating that the general alignment methods revealed more genetic distance.

Figure 7
figure 7

The overall pairwise distance differences in cases A, B, C and D.

Case E showed a compound-imperfect SSR repeat with the tandems GAA, GAT, and GAGGAT respectively. This imperfect SSR represents 9.4% of the sequence tested in case E. The alignment process showed clear differences before and after the SSR region was treated with the new algorithm (Figure 8). Despite the small percentage of this SSR in the whole sequence in case E, the phylogenic trees showed that the genetic distance of the most 24 related sequences was decreased from 0.00317 to 0.002 (Figure 9). Further, more sequences that are similar resulted in less branches.

Figure 8
figure 8

A comparison between two alignments of the sequence of case A by using MEGA4 software (A) and the new software prepared in this paper (B).

Figure 9
figure 9

Case E phylogenic tree before (A) and after (B) applying the algorithm.

The main limitation with the new algorithm is in determining the gap position when applied to an imperfect SSR. According to Kruglyak [17] and Bandström [18], the imperfect repeats within the SSR region reduces the occurrences of slippage, resulting in the imperfect SSR changing its tandem nature and fixing the region by prohibiting replication slippage. This is because the bases do not find their complementary bases during replication. However, the best place for the imperfect nucleotides within a compound SSR is after the slippage site (the gap) and before the sequence that follows SSR or the next repeated unit (Figure 7).

We can deduce from the last examples that (1) the new algorithm could be a powerful tool for compound SSRs, but less so for a simple SSR, (2) it increase the similarity between sequences during alignment by minimizing the overlaps between different repeated units, and (3) it might be necessary to apply it on sequences containing long and complicated SSRs.

SSR alignment tool (SALT)

SALT is a new tool for making an alignment for SSR loci using the new algorithm. It was written using the PERL programming language. Figure 10 shows the main window of the program which consists of five textboxes for the names or the directories for the input and the output files. The user should determine his tandem repeats by putting a space character between each repeated unit and the next one in the third textbox. The remaining text boxes are for identifying the first and the last nucleotide position of the SSR locus in the whole sequence. There are also four buttons, two for browsing the input and the output files, the third for making the alignment, and the last for closing the program.

Figure 10
figure 10

The window of Ssr ALignment Tool (SALT).

The input file should be aligned sequences in fasta format or in .txt format:

  1. 1.

    The first line contains the number of samples, followed by any kind of separator (space or tab...) and, subsequently, the number of nucleotides.

  2. 2.

    Each of the next lines contains the name of the allele, followed by any kind of separator, then the sequence; thereafter press the Enter button to start the next allele.

See the additional file 2: SALT.rar. This is a compressed file containing the program and the sample data used in this research.

Conclusions

SALT is a new tool to overcome limitations when aligning SSR loci based on the new shifting algorithm proposed in this paper. This tool is essential when aligning compound or imperfect SSRs, which contain many overlaps between repeated units, and when aligning them using the usual methods. The newly developed tool gives a better alignment estimate for such regions.

Materials and methods

Five different sequences (Table 1) of SSR motifs obtained from a biotechnology laboratory (Genetic Resources Section, ICARDA), were used in this research. These sequences were obtained from 26 plants representing 26 alleles. The sequences were aligned using the clustalW algorithm implemented in MEGA 4 with the following default settings: gap opening penalty 15, gap extension penalty 6.66, IUB weight matrix, transition weight 0.5, and delay divergent cut-off 30 [23]. The same software drew the phylogenic tree with the UPGMA method. The PERL programming language was used to design a new algorithm for SSR alignment [24] The Tk package was used to make the graphical interface [25].

Table 1 Five microsatellite motifs vary in their types and lengths, representing most SSR types in the genome sequences