Introduction

Multiple sequence alignment is a basic task in Bioinformatics and has many applications in biological analyses such as phylogenetic inferencing and protein 3D structure prediction. The progressive alignment method [1] is one of the most commonly used methods for multiple sequence alignment. Roughly speaking, the method first constructs a guide tree that is supposed to capture the phylogenetic relationship of the input sequences, and then aligns the sequences progressively according to the topology of the guide tree such that more related sequences are aligned first and the less related ones are aligned later.

Recently, we have proposed an adaptive approach for progressive multiple sequence alignment[2]. We observed that for different sequence families with different similarities, their alignments usually have different characteristics and structural properties, and by using some reliable measure to estimate the similarity of the inputs, we may exploit the corresponding properties to help generate better alignments. To estimate the similarity, we proposed to use the average percent identity, which is defined as follows. For any two sequences, the percent identity of these two sequences is defined to be

PID = N Identity L Alignment

where NIdentity is the number of identities in the optimal pairwise alignment of the two sequences, and LAlignment is the length of this alignment. The average percent identity PID ¯ of the input sequences is the average of the PIDs over every pair of the sequences. In [2], we noted that if PID ¯ is greater than 40%, the input sequences are very similar, and we showed how to exploit the properties of similar sequences and align the sequences globally. If PID ¯ is between 25% and 40%, the input are moderately similar, and we can exploit the corresponding properties to align them locally. For input below 25%, we do not know which alignment methods is better; hence we suggested trying different methods (e.g., using global alignment methods as well as local alignment methods) and using their consensus to determine the final alignment.

To test the effectiveness of our idea, we developed a software tool called GLProbs, which implements our adaptive approach for multiple sequence alignment. We have done extensive testings and empirical comparisons for GLProbs, and the results showed that GLProbs has significantly better accuracy than a dozen of other leading MSA tools (see [2] for more details).

In this paper, we study why GLProbs can achieve such a high accuracy, and exploit ways to further improve the software tool. In particular, we are interested in finding out the impact of the adaptive guide tree construction method used in GLProbs. This also leads us to study the following fundamental question:

Are guide trees really important to obtain high quality multiple sequence alignments, and if yes, how to construct the best guide trees.

We note that there are already studies suggesting that guide trees are important. For example, Penn et al.[3] showed that uncertainties in the guide tree lead to a major source of alignment uncertainty, and Capella-Gutierrez and Gabaldon[4] showed that most gaps are inserted in patterns that follow the guide tree.

To study the guide trees of GLProbs, we have done the following tests.

First, we modified GLProbs to GLProbs-Random in which the adaptive guide tree construction step of GLProbs was replaced by a step that just generates a random guide tree. Then we compared the performance of GLProbs and GLProbs-Random empirically.

Second, we modified GLProbs to a new tool GLProbs-Reference and compared their performance of aligning families of protein sequences whose correct multiple sequence alignments are generally agreed by the biologists. The modification done to get GLProbs-Reference is that the guide tree generated by GLProbs is replaced by the phylogenetic tree constructed as follows: Based on the known correct alignment of the input sequences we construct their phylogenetic trees using the maximum-likelihood method [5], and then use these phylogenetic trees as the guide trees. Intuitively these phylogenetic trees should be the best guide trees for the alignments. The aim of this test is to find out whether the guide trees constructed by the adaptive method are competitive among the best.

Finally, we study whether the adaptive guide tree construction method of GLProbs can bring similar improvement to other MSA tools. We have modified five leading multiple sequence alignment tools, namely MSArobs [6], Probalign [7], Prob-Cons [8], T-Coffee [9], ClustalW [10], by replacing their original guide trees construction steps with the adaptive guide tree construction step, and keeping other steps intact. Then we compare their performance on aligning protein sequence families obtained from three popular benchmark datasets.

We will detail the results of our tests in Sections 2, 3 and 4. Below, we summarise our conclusions.

• For sequences with high similarity, the guide tree construction method is not critical; many reasonable methods can generate good enough guide trees leading to satisfactory alignments.

• For sequences with moderate similarity, better guide trees are very important for generating good alignments. Our study showed that the guide trees constructed by the adaptive method of GLProbs are usually among the best, and they can be used to improve the performance of other MSA tools.

• For sequences with very low similarity, the adaptive guide tree construction method can also improve the accuracy of other MSA tools; in fact, the improvements are larger than those obtained for other more similar sequences. However, the accuracy of these alignments is still very low. We found that for sequences with very low similarity, it is very difficult to generate good guide trees, and using a bad guide tree will have serious detrimental effect on the quality of the resulting alignment. For these sequences, we suggest using other methods, such as the non-progressive alignment method, that do not rely on guide trees for generating better alignments.

Comparing adaptive guide trees with random trees

As aforementioned, the progressive multiple sequence alignment method needs to construct a guide tree to determine the order of the progressive alignments. Intuitively, the accuracy of the alignments depends much on the quality of the guide trees; if the aligned orders are wrong, the accuracy may be low.

To confirm this intuition, we have modified GLProbs to GLProbs-Random, which replaces the guide tree constructed in GLProbs by a random guide tree. We have used them to align protein sequences families obtained from the benchmark database OXBench. Figure 1 shows their alignments' sum-of-pairs (SP) scores and total column (TC) scores, two of the most commonly used scores for measuring the quality of MSA. Each dot (x, y) in the figure shows the scores obtained by GLProbs and GLProbs-Random for one testing sample, where x is the score obtained by GLProbs and y by GLProbs-Random. Unsurprisingly, we note that most points are below the diagonals, which means GLProbs outperformed GLProbs-Random. This confirms the importance of guide trees. However, it is interesting to observe that there are also many points above the diagonals, which means that for these inputs, random guide trees are better than the guide trees elaborately generated by GLProbs. After a careful study of the inputs, we found that most of these inputs have low similarities. We believe that to generate better alignments for these inputs, we should abandon the progressive method, and try other methods such as the non-progressive alignment method [11], that do not rely on guide trees to generate their alignments.

Figure 1
figure 1

GLProbs vs GLProbs-Random on OXBench in term of SP and TC scores.

Using adaptive guide trees to improve other leading MSA tools

To study whether the adaptive guide tree construction method of GLProbs can improve the accuracy of other MSA tools, we have applied it to five leading multiple sequence alignment tools: MSArobs [6], Probalign [7], ProbCons [8], T-Coffee [9] and ClustalW [10], by modifying these tools so that they used the adaptive guides trees constructed by GLProbs. We note that these five tools have their own special features. ClustalW is among the first tools using progressive alignment, and has become one of the most popular MSA tools since its release in 1994. MSAProbs, Probalign and ProbCons apply the consistency-based method to improve the accuracy of the progressive alignment. T-Coffee provides a simple and flexible means of producing multiple sequence alignments by using heterogeneous data sources given by a library of global and local pairwise alignments.

In our tests, we used samples from three popular benchmark datasets, namely BAliBASE [12], OXBench [13], and SABmark [14]. In particular, we use the two subsets RV11 and RV12 in BALiBASE, where RV11 contains distant sequences with < 20% identity while RV12 consists of medium to divergent sequences with identities between 20% and 40%. For SABmark we used its two subsets: Twilight Zone and Superfamily. Twilight Zone represents different SCOP folds subsets, where each subset contains sequences with no more than 25% identity. Superfamily contains different SCOP superfamilies, which have no more than 50% identity. For OXBench, the families of sequences we used ranging from 0% to 100% similarity.

Table 1 shows the average SP and TC scores obtained from the original alignment tools (listed in the columns labeled with "Original"), and those obtained by the modified tools, in which their guide trees are replaced by the ones constructed by the adaptive method of GLProbs (listed in the columns labeled with "Adpative"). Note that the average SP scores of all of the five aligners with guide tree generated by the adaptive method outperform those generated by original aligners on all of the three benchmarks. For TC scores, we can see that using guide tree generated by the adaptive method can improve the average score in most cases.

Table 1 Mean SP and TC scores on BAliBASE, OXBench and SABmark

We also divided OXBench's input families into four categories according to their similarities. For example, the "0%-20%" category contains the families that the similarities of which are between 0% and 20%. Category "0%-100%" contains all of the families. In Table 2 it can be seen that the adaptive method can improve most aligners in most cases, especially in the low similarity categories.

Table 2 Mean SP and TC scores on OXBench

Table 3 and Table 4 show the results on SABmark and BAliBASE, in which the results are divided into two categories according to the similarity of the input families. All of the aligners have improvement in the low similarity "0%-30%" category, except the average TC score of T_Coffee's alignments on SABmark.

Table 3 Mean SP and TC scores on BAliBASE
Table 4 Mean SP and TC scores on SABmark

Comparing adaptive guide trees with reference guide trees

To compare the adaptive guide trees with the best ones, we modified GLProbs to GLProbs-Reference, in which the guide tree generated by GLProbs is replaced by the phylogenetic tree constructed by applying the maximum-likelihood method [5] on the correct MSA of the input sequences. Figure 2 compares the SP and TC scores of the alignments constructed by GLProbs and GLProbs-Reference for the sequence families obtained from the three benchmark databases BAliBASE, OXBench and SABmark. The figure shows that most points located around the diagonal, which suggests the performances of using the reference (best) guide tree and that generated by the adaptive scheme are similar.

Figure 2
figure 2

GLProbs vs GLProbs-Reference on BAliBASE, OXBench, SABmark. Dots above diagonal represent GLProbs outperformed GLProbs-Reference.