Background

Aligning multiple evolutionarily related protein sequences is a fundamental technique for studying protein function, structure, and evolution. Multiple sequence alignment methods are often an essential component for solving challenging bioinformatics problems such as protein function prediction, protein homology identification, protein structure prediction, protein interaction study, mutagenesis analysis, and phylogenetic tree construction. During the last thirty years or so, a number of methods and tools have been developed for multiple sequence alignment, which have made fundamental contributions to the development of the bioinformatics field.

State of the art multiple sequence alignment methods adapt some popular techniques to improve alignment accuracy, such as iterative alignment [1], progressive alignment [2], alignment based on profile hidden Markov models [3], and posterior alignment probability transformation [4, 5]. Some alignment methods, such as 3D-Coffee [6] and PROMALS3D [7], use 3D structure information to improve multiple sequence alignment, which cannot be applied to the majority of protein sequences without tertiary structures. In order to overcome this problem, we have developed a method to incorporate secondary structure, relative solvent accessibility, and contact map information predicted from protein sequences into multiple sequence alignment. Predicted secondary structure information has been used to improve pairwise sequence alignment [8, 9], but few attempts had been made to use predicted secondary structure information in multiple sequence alignment [1015]. To the best of our knowledge, applying predicted relative solvent accessibility and residue-residue contact map to multiple sequence alignment is novel.

In order to use the predicted structural information to advance the state of the art of multiple sequence alignment, we first compared the existing multiple sequence alignment tools [1631, 4, 5, 3237] on the standard benchmark data sets such as BAliBASE [38], SABmark [39] and OXBENCH [40], which showed that MAFFT [30], T-coffee [31], MSAProbs [4], and ProbCons [5] yielded the best performance. Then we developed MSACompro, a new multiple sequence alignment method, which effectively utilizes predicted secondary structure, relative solvent accessibility, and residue-residue contact map together with posterior alignment probabilities produced by both pair hidden Markov models and partition function as in MSAProbs [4]. The assessment results of MSACompro compared to the benchmark data sets from BAliBASE, SABmark and OXBENCH showed that incorporating predicted structural information has improved the accuracy of multiple sequence alignment over most existing tools without using structural features and sometimes the improvement is substantial.

Method

Following the general scheme in MSAProbs [4], MSACompro has five main steps: (1) compute the pairwise posterior alignment probability matrices based on both pair-HMM and partition function, considering the similarity in amino acids, secondary structure, and relative solvent accessibility; (2) generate the pairwise distance matrix from both the pairwise posterior probability matrices constructed in the first step and the pairwise contact map similarity matrices; (3) construct a guide tree based on pairwise distance matrix, and calculate sequence weights; (4) transform all the pairwise posterior matrices by a weighting scheme; (5) perform a progressive alignment by computing the profile-profile alignment from the probability matrices of all sequence pairs, and then an iterative alignment to refine the results from progressive alignment. Our method is different from MSAProbs in that it adds secondary structure and solvent accessibility information to the calculation of the posterior residue-residue alignment probabilities and computes the pairwise distance matrix with the help of predicted residue-residue contact information.

Construction of pairwise posterior probability matrices based on amino acid sequence, secondary structure and solvent accessibility information

For two protein sequences X and Y in a sequence group S to be aligned, we denote X = (x1, x2,......,x n 1), Y = (y1, y2,......,y n 2), where x1, x2,......, x n 1 and y1, y2,......,y n 2 are lists of the residues in X and Y, respectively. n1 is the length of sequence X, and n2 is the length of sequence Y. Suppose x i is the i-th amino acid in sequence X, and y j is the j-th amino acid in sequence Y. We let aln denote a global alignment between X and Y, ALN the set of of all the possible global alignments of X and Y, and aln* ∈ ALN the true pairwise alignment of X and Y. The posterior probability that the i-th residue in X (x i ) is aligned to the j-th residue (y j ) in Y in aln* is defined as:

p ( x i ~ y j a l n * | X , Y ) = a l n A L N P ( a l n | X , Y ) I { x i ~ y j a l n }
(1)
( 1 x i n 1 , 1 y j n 2 )
I { x i ~ y j a l n } = { 1 , i f ( x i ~ y j a l n ) t r u e o , o t h e r w i s e

P(aln | X, Y) denotes the probability that aln is the true alignment aln*: Thus, the posterior probability n1 × n2 matrix P XY is a collection of all the values p(x i ~ y j aln* | X, Y) (p(x i ~ y j ) for short) for 1 ≤ x i n1, 1 ≤ y j n 2. The calculation process of the pairwise posterior probability matrix is described as follows.

As in MSAProbs, two different methods (a pair hidden Markov model and a partition function) are used to compute the pairwise posterior probability matrices ( P X Y 1 and P X Y 2 ), respectively. The first kind of pairwise probability matrix P X Y 1 is calculated by a partition function (F) of alignments based on dynamic programming. F(i, j) denotes the probability of all partial global alignments of X and Y ending at position (i, j). F M (i, j) is the probability of all partial global alignments with x i aligned to y j , F y (i, j), is the probability of all partial global alignments with y j aligned to a gap, and F X (i, j) is the probability of all partial global alignments with x i aligned to a gap. Accordingly, the partition function can be calculated recursively as follows:

F M ( i , j ) =F ( i - 1 , j - 1 ) e W 1 β s ( x i , y j ) + W 2 S S ( s s ( x i ) , s s ( y j ) ) + W 3 S A ( s a ( x i ) , s a ( y j ) )
F Y ( i , j ) = F M ( i , j - 1 ) e β g a p + F Y ( i , j - 1 ) e β e x t
(2)
F X ( i , j ) = F M ( i - 1 , j ) e β g a p + F X ( i - 1 , j ) e β e x t
F ( i , j ) = F M ( i , j ) + F Y ( i , j ) + F X ( i , j )

Subject to the constraint W1 + W2 + W3 = 1.

In the formula above, s(x i , y j ) is the amino acid similarity score between x i and y j . One element of the substitution matrix s, SS(ss(x i ), ss(yj)) is the similarity score between the secondary structure (ss(x i )) of residue x i in protein X and that of residue y j in protein Y according to the secondary structure similarity matrix SS, SA(sa(x i ), sa(y j )) is the similarity score between the relative solvent accessibility (sa(x i )) of residue x i in protein X and that of residue y j in protein Y according to the solvent accessibility similarity matrix SA. W1, W2, W3 are weights used to control the influence of the amino acid substitution score, secondary structure similarity score, and solvent accessibility similarity score. The secondary structure and solvent accessibility can be automatically predicted by SSpro/ACCpro [41] (http://sysbio.rnet.missouri.edu/multicom_toolbox/) using a multi-threading technique implemented in MSACompro, or alternatively be provided by a user. The values of the three weights are set to 0.4, 0.5, and 0.1 by default, and can be adjusted by users. The ensembles of bidirectional recurrent neural network architectures in ACCpro are used to discriminate between two different states of relative solvent accessibility, higher or lower than the accessibility cutoff - 25% of the total surface area of a residue [42], corresponding to e or b. As in MSAprobs, β is a parameter measuring the deviation between suboptimal and optimal alignments, gap(gap ≤ 0) is the gap open penalty, and ext(ext ≤ 0) is the gap extension penalty.

We used the Gonnet 160 matrix as a substitution matrix to generate the similarity scores between two amino acids in proteins [43]. The 3 × 3 secondary structure similarity matrix SS contains the similarity scores of three kinds of secondary structures (E, H, C) as follows:

SS= 100 010 001

, where two identical secondary structures receive a score of 1 and different ones receive a score of 0.

The 2 × 2 solvent accessibility similarity matrix SA contains the similarity scores of two kinds of relative solvent accessibilities (e, b) as follows:

SA= 10 01

, where two identical solvent accessibilities receive a score of 1 and different ones receive a 0. It is worth noting that we used the simple identity scoring matrix for secondary structure and solvent accessibility here. Employing more advance scoring matrices defined in [44] may lead to further improvement. Each posterior residue-residue alignment probability element in the first kind of posterior probability matrix ( P X Y 1 ) can be calculated from the partition function as:

p 1 ( x i ~ y j ) = F M ( i 1 , j 1 ) F M ' ( i + 1 , j + 1 ) F e W 1 β s ( x i , y j ) + W 2 S S ( s s ( x i ) , s s ( y j ) ) + W 3 S A ( s a ( x i ) , s a ( y j ) )
(3)

, where F M ( i , j ) denotes the partition function of all the reverse alignments starting from the position (n1, n2) till position (i, j) with x i aligned to y j .

As in MSAProbs, the second kind of pairwise probability matrix P X Y 2 is calculated by a pair hidden Markov model (HMM) combining both Forward and Backward algorithm [4, 5, 45]. The pairwise probabilities can be generated under the guidance of pair HMM involving state emissions and transitions. P X Y 2 is only derived from protein sequences without using secondary structure and solvent accessibility, which is different from PROMALS [15] that lets HMM emit both amino acids and secondary structure alphabets.

The final posterior probability matrix P XY is calculated as the root mean square of the corresponding values in P X Y 1 and P X Y 2 as follows.

p ( x i ~ y j ) = p 1 ( x i ~ y j ) 2 + p 2 ( x i ~ y j ) 2 2
(4)

where p1(x i ~ y i ) and p2(x i ~ y i ) denote a posterior probability element in two kinds of posterior probability matrices ( P X Y 1 and P X Y 2 ), respectively.

Construction of pairwise distance matrices based on pairwise posterior probabilities and pairwise contact map scores

The posterior probability matrix P XY is used as a scoring function to generate a pairwise global alignment between sequences X and Y. The optimal global alignment score Opt(X,Y) of the global alignment is computed according to an optimal sub-alignment score matrix AS. The optimal sub-alignment score AS(i, j) denotes the score of the optimal sub-alignment ending at residues i and j in X and Y. The AS matrix is recursively calculated as:

A S ( i , j ) = max { A S ( i 1 , j 1 ) + P X Y ( x i ~ y j ) A S ( i 1 , j ) A S ( i , j 1 )
(5)

AS (n1, n2) is the optimal score of the full global alignment between X and Y, which is denoted as Optscore(X,Y).

In addition to the optimal alignment score, we introduce a contact map score, CMscore(X, Y), for the optimal pairwise alignment of X and Y, assuming that the spatially neighboring residues of two aligned residues should have a higher tendency to be aligned together. CMscore(X, Y) is calculated from the contact map correlation score matrix CMap XY based on the residue-residue contact map matrices CMap X and CMap Y of X and Y.

Assuming the optimal global alignment of X and Y is represented as,

x 1 x 2 . . . . . . . - x m . . . . . . x p . . . . . . x n 1 y 1 - . . . . . . y k y k + 1 . . . . . - . . . . . . y n 2

we can generate a new alignment after removing the pairs containing gaps:

x 1 . . . . . . . x m . . . . . . . . . . . . x n 1 y 1 . . . . . . y k + 1 . . . . . . . . . . . y n 2

, which can be denoted as

x 1 x 2 . . . . . . . . . . . . x n y 1 y 2 . . . . . . . . . . . y n

, where n is the length of the new alignment without gaps

From this alignment, we can construct two contact map matrices, CMap X and CMap Y , shown below:

CMa p X = x 11 x 12 . . . . . . x 1 n x 21 x 22 . . . . . . x 2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x n 1 x n 2 . . . . . . x n n
(6)
CMa p Y = y 11 y 12 . . . . . . y 1 n y 21 y 22 . . . . . . y 2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . y n 1 y n 2 . . . . . . y n n
x i j

is the contact probability score between amino acid x i and x j in protein sequence X, and y i j is the contact probability score between amino acid y i ' and y j ' in protein sequence Y. The residue-residue contact probabilities are predicted from the sequence by NNcon [46] (http://sysbio.rnet.missouri.edu/multicom_toolbox/). The contact map correlation score matrix CMap XY is calculated as the multiplication of CMap X and CMap Y :

C M a p X Y = C M a p X × C M a p Y = x y 11 x y 12 . . . . x y 1 n x y 21 x y 22 . . . . x y 2 n . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x y n 1 x y n 2 . . . . x y n n
(7)
x y i i

is the contact map score for an aligned residue pair (amino acid x i in protein X and amino acid y i ' in protein Y). The contact map score for the global alignment of two sequences X and Y is calculated as

C M s c o r e ( X , Y ) = 1 n 2 i = 1 n C M a p X Y ( i , i ) = 1 n 2 i = 1 n x y i i = 1 n 2 i = 1 n j = 1 n x i j y j i
(8)

In practice, we only need to calculate the diagonal values in CMap XY .

Finally, we define the pairwise distance between sequences X and Y as

d ( X , Y ) =1- W 4 O p t s c o r e ( X , Y ) min { n 1 , n 2 } - W 5 CMscore ( X , Y )
(9)

, where W4 + W5 = 1. The weights W4 and W5 are used to control the influence of sequences X and Y.

Construction of guide tree and transformation of posterior probability

Akin to MSAProbs [4], a guide tree is constructed by the UPGMA method that uses the linear combinatorial strategy [47]. The distance between a new cluster Z formed by merging clusters X and Y, and another cluster W is calculated as (10):

d ( W , Z ) = d ( W , X ) × N u m ( X ) + d ( W , Y ) × N u m ( Y ) N u m ( X ) + N u m ( Y )
(10)

In which Num(X) is the number of leafs in cluster X.

After the guide tree is constructed, sequences are weighted according to the schemes inferred in [4].

To reduce the bias of sampling similar sequences, we use a weighted scheme to transform the former posterior probability as

P X Y = 1 w N ( ( w X + w Y ) P X Y + Z S , Z X , Y w z P X Z P Z Y )
(11)

w X and w Y are, respectively, the weight of sequences X and Y, w Z is the weight of a sequence Z other than X or Y in the given group of sequences, and wN is the sum of sequence weights in dataset S.

Combination of progressive and iterative alignment

We first use the guide tree to generate a multiple sequence alignment by progressively aligning two clusters of the most similar sequences together. As in MSAProbs [4], we also apply a weighted profile-profile alignment to align two clusters of sequences. The sequence weights are the same as in the previous step. The posterior alignment probability matrix of two clusters/profiles is averaged from the probability matrices of all sequence pairs (X, Y), where x and y are from the two different clusters. Formula (5) used to generate the global profile-profile alignment is based on the posterior alignment probability matrices of the profiles. In order to further improve the alignment accuracy, we then use a randomized iterative alignment to refine the initial alignment. This randomized iterative refinement randomly partitions the given sequence group S into two separate groups, and performs a profile-profile alignment of the two groups. The iterative refinement can be completed after 10 iterations by default, or a fixed number of iterations set by users. Generally speaking, the final progressive alignment orders sequences along the guide tree from closely related to distantly related. To improve the alignment accuracy, a final iterative alignment is applied to refine the results from progressive alignment. In addition, a multi-thread technology based on OpenMP is also used to improve the efficiency of the program [48].

Results and discussion

Evaluation of MSACompro and other tools on the standard benchmarks

We tested MSACompro in comparison to three benchmarks: BAliBASE, SABmark and OXBENCH, and evaluated the alignment results in terms of sum-of-pairs (SP) score and true column (TC) score. The SP score is the number of correctly aligned pairs of residue in the test alignment divided by the total number of aligned pairs of residues in core blocks of the reference alignment [49]. The TC score is the number of correctly aligned columns in the test alignment divided by the total number of aligned columns in core blocks of the reference alignment [49]. We used the application bali_score provided by BAliBASE 3.0 to calculate these scores. We compared MSACompro to 11 other MSA tools which do not have access to the structural information, including ClustalW 2.0.12, DIALIGN-TX 1.0.2 [27], FSA 1.15.5, MAFFT 6.818, MSAProbs 0.9.4, MUSCLE 3.8.31, Opal 0.2.0, POA 2, Probalign 1.3, Probcons and T-coffee 8.93. It is worth noting that a fair comparison between our method with these multiple sequence alignment methods without using structural features is not possible because these methods use less input information. So, the goal of comparison is to present the idea that structural information-based alignment may contain valuable information that is not available in sequence-based multiple sequence alignments and can therefore be a supplement to sequence-based alignments. And to make the evaluation more fair and comprehensive, we also compared MSACompro with four tools which use structural information, including MUMMALS 1.01 [14], PROMALS [15] and PROMALS3D [7].

To understand how various parameters of MSACompro affect alignment accuracy, some experiments were carried out to evaluate these variants based on two algorithm changes: (1) combining amino acids, secondary structure, and relative solvent accessibility information into the partition function calculation using respective weights for each of them; (2) computing the pairwise distance from both the pairwise posterior probability matrices and the pairwise contact map similarity matrices by introducing the weight wc for contact map information. To optimize the parameters, we used BAliBASE 3.0 data sets as training sets, and SABmark 1.65 and OXBENCH data sets as testing sets. Firstly, we focused on the effect of secondary structure and solvent accessibility information by testing different values of weight w1 for amino acid similarity and weight w2 for secondary structure information on BAliBASE 3.0 data sets. MSACompro worked wholly the best if the weight w1 for amino acid similarity and the weight w2 for secondary structure information were 0.4 and 0.5, respectively. Since the sum of w1, w2 and wc is 1, we can deduce that wc is 0.1 if w1 and w2 are 0.4 and 0.5. Then we focused on the effect of residue-residue contact map information under two different scenarios: using secondary structure and relevant solvent accessibility information by keeping the w1, w2, and w3 at their optimum values (0.4, 0.5, 0.1), or excluding that information by setting both w2 and w3 as 0. Evaluation results on BAliBASE 3.0 database were found to improve the most when wc is 0.9 by integrating both secondary structure and relevant solvent accessibility information. Additionally, to avoid over-fitting, we tested MSACompro against SABmark 1.65 and OXBENCH data sets using this set of parameters independently, and found that a significant improvement was also gained in comparison to other leading protein multiple sequence alignment tools. More details can be found in the next section, "A comprehensive study on the effect of predicted structural information on the alignment accuracy". Consequently, the weights w1, w2, w3 and wc are respectively set at 0.4, 0.5, 0.1 and 0.9 in MSACompro by default. All other tools were also evaluated under default parameters.

Firstly, we evaluated these methods on BAliBASE [16] - the most widely used multiple sequence alignment benchmark. The latest version, BAliBASE 3.0, contains 218 reference alignments, which are distributed into five reference sets. Reference set 1 is a set of equal-distant sequences, which are organized into two reference subsets, RV11 and RV12. RV11 contains sequences sharing >20% identity and RV12 contains sequences sharing 20% to 40% identity. Reference set 2 contains families with >40% identity and a significantly divergent orphan sequence that shares <20% identity with the rest of the family members. Reference set 3 contains families with >40% identity that share <20% identity between each two different sub-families. Reference set 4 is a set of sequences with large N/C-terminal extensions. Reference set 5 is a set of sequences with large internal insertions. Tables 1, 2, and 3 report the mean SP scores and TC scores of MSACompro and the tools without using structural information for the six subsets and the whole database. All the scores in the tables are multiplied by 100, and the highest scores in each column are marked in bold. The results show that MSACompro received the highest SP and TC scores on the whole database and all the subsets except for the SP score for the subset RV40. In some cases, MSACompro's improvement was substantial.

Table 1 Total SP scores on the full-length BAliBASE 3.0 subsets.
Table 2 Total TC scores on the full-length BAliBASE 3.0 subsets.
Table 3 Overall mean SP and TC scores on the full-length BAliBASE 3.0 subsets.

Secondly, we evaluated MSACompro and other tools without the help of structural information on the SABmark database [4], which is a very challenging data set for multiple sequence alignment according to a comprehensive study [50]. SABmark is an automatically generated data set consisting of two sets. One set is from SOFI [51] and the other is from the ASTRAL database [52], which contains remote homologous sequences in twilight-zone or superfamily. Since some pairwise reference alignments in SABmark are not generally consistent with multiple alignments, a subset of SABmark, 1.65 called SABRE [53], has been widely used as a multiple sequence alignment benchmark database. SABRE was constructed by identifying mutually consistent columns (MCCs) in the pairwise reference structure alignment. MCCs are considered similar to BAliBASE core blocks. SABRE contains 423 out of 634 SABmark groups that have eight or more MCCs. Table 4 shows the overall mean SP and TC scores of the alignments. The mean SP and TC scores of MSACompro are 8.3 and 9.1 points higher than those of the second best-performer, MSAProbs, demonstrating that incorporating predicted structural features into multiple sequence alignments can substantially improve alignment accuracy for even remotely related homologous sequences. Figure 1 shows an example comparison between the alignments generated by our method, MSACompro, and MSAProbs from the SABRE database. The SP and TC scores significantly improved from 0.307 to 0.853 and 0 to 0.780, respectively. This case demonstrates that taking predicted structural information can help avert aligning unmatched regions, especially when the sequence similarity is unrecognizable.

Table 4 Overall mean SP and TC scores on the SABmark 1.65.
Figure 1
figure 1

an example in SABRE database comparing the alignments generated by our method and MSAProbs. The reference alignment and resulting alignments generated by both methods are respectively shown in the figure. The correct alignment regions significantly improved by our MSACompro after taking structural information are marked in red rectangles. In contrast, the corresponding incorrect alignment regions generated by MSAProbs are represented in green rectangles. The predicted secondary structure and solvent accessibility information for the correctly aligned regions are shown in circles.

Thirdly, we also assessed all the tools without using the structural information on the OXBENCH database [54]. OXBENCH is also a popular benchmark database generated by the AMPS multiple alignment method from the 3Dee database of protein structural domains [55]. The conserved columns in OXBENCH can be considered similar to BAliBASE core blocks. The mean SP and TC scores over the whole database are shown in Table 5. The results show that MSACompro improves the alignment accuracy over all other methods.

Table 5 Overall mean SP and TC scores on the OXBENCH. Bold denotes highest scores.

Finally, we also compared the SP scores and TC scores of MSACompro and other tools which adopt the structural information on the six subsets of BAliBASE database, SABmark database and OXBENCH database. Tables 6 and 7 demonstrate the SP and TC scores across the three databases. The results show that MSACompro gained the highest scores on three out of six subsets of BAliBASE and achieved the third highest scores on other data sets, which are lower than PROMALS3D that used true experimental structures as input and PROMALS that used both predicted secondary structures and additional homologous protein sequences found by PSI-BLAST search's on a large protein sequence database [15]. Overall, MSACompro performed similarly as PROMALS, whereas the latter has an advantage on a remote homologous protein sequence data set SABmark since it directly incorporates additional homologous protein sequences to improve the alignment of remotely related target sequences during the progressive alignment process. Moreover, the accuracy of MSACompro on the BAliBASE 3.0 data sets seems to be higher than the published results of another alignment tool of using secondary structure information - DIALIGN-SEC [12], which was not directly tested in our experiment because it is only available as a web server other than a downloadable software package. Therefore, MSACompro is useful to improve the accuracy of multiple sequence alignment in general and particularly for most cases in reality where experimental structures are not available.

Table 6 Total SP scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.
Table 7 Total TC scores of the tools which use the structural information on BAliBASE 3.0 subsets, SABmark data sets and OXBENCH data sets.

In order to check if alignment score differences between MSACompro and the other alignment methods are statistically significant, we carried out the Wilcoxon matched-pair signed-rank test [56] on both SP and TC scores of these methods on the three data sets. The p-values of alignment score differences calculated by the Wilcoxon matched-pair signed-rank test are reported in Table 8. Generally speaking, the alignment scores of MSACompro are significantly higher than all the alignment methods without using structural information and MUMMALS of using structural information in all but one case according to the significance threshold of 0.05. The exception is that MSACompro's TC score is higher than MSAProbs on the BAliBASE, but not statistically significant. However, the alignment scores of MSACompro are mostly statistically lower than the other two alignment methods (PROMALS or PROMALS3D) of using predicted structural features, more homologous sequences, or tertiary structures.

Table 8 The statistical significance (i.e. p-values) of SP and TC alignment score differences between MSACompro and the other tools on three benchmark data sets.

In addition to alignment accuracy, alignment speed is also a factor to consider in time-critical applications. Because it is difficult to rigorously compare the speed of different methods due to the difference in implementation and inputs, we only report the roughly estimated running time of the different methods on BAliBASE based our empirical observations. The fastest methods are ClustalW, MAFFT, MUSCLE, and POA, which used less than one hour. The medium-speed methods that used a few hours to less than one day include FSA, Opal, Probalign, MSAProbs, ProbCons, T-coffee, MUMMALS, and DIALIGN-TX. The more time demanding methods are MSACompro, PROMALS, and PROMALS3D because they need to generate extra information for alignment. We ran both PROMALS and MSACompro on the BAliBASE 3.0 database on an 4 eight-core (i.e. 32 CPU cores) Linux server to calculate their running time. It took about 4 days and 6 hours for PROMALS to run on the whole BAliBASE 3.0 data sets, and about 9 hours and 13 minutes for MSACompro to run on the same data sets. MSACompro was faster because it used a multiple-threading implementation to call SSpro/ACCpro to predict secondary structure and solvent accessibility in parallel. Out of about 9 hours and 13 minutes, about four hours and 17 minutes were used by MSACompro to align sequences if secondary structure and solvent accessibility information was provided. However, if only one CPU core is used, it took around 6 days and 14 hours for SSpro and ACCpro called by MSACompro to predict secondary structure and solvent accessibility information alone, which is time-consuming. Therefore, MSACompro will be slower than PROMALS if it runs a single CPU core, but faster on multiple (> = 3) CPU cores. As for PROMALS3D, it used about 9 days to extract tertiary structure information and make alignments.

A comprehensive study of the effect of predicted structural information on the alignment accuracy

To understand the impact of predicted secondary structure, relative solvent accessibility, and contact map on the accuracy of multiple sequence alignment, we tested their effects on alignments individually or in combination by adjusting the values of their weights used in the partition function (i.e. for secondary structure and solvent accessibility) or in the distance calculation (i.e. for contact map).

I. Effect of secondary structure information

We studied the effect of secondary structure information by adjusting the values of w1 (weight for amino acid sequence information) and w2 (weight for secondary structure information), the sum of which was kept as 1, and setting the values of w3 (weight for relative solvent accessibility) and wc (weight for contact map) to 0. The results for different w2 values on the SABmark data sets are shown in Table 9. The highest score is denoted in bold and by a superscript of star, and the second highest is denoted in bold. The results show that incorporating secondary structure information always improves alignment accuracy over the baseline established without using secondary structure information (w2 = 0). The highest accuracy is achieved when w2 is set to .5, at which point the score is 8 points greater than the baseline. w2 = 1 means that only secondary structure is used to calculate the posterior alignment probability in the partition function (i.e. equation set (2)), but amino acid sequence similarity is still used to calculate the other posterior alignment probability by the pair Hidden Markov Models. Figures 2 and 3 plot the SP and TC scores against weight values in Table 9 and Table 10, respectively.

Table 9 SP scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 2
figure 2

the 2D plot of SP scores against w 2 on the SABmark dataset.

Figure 3
figure 3

the 2D plot of TC scores against w 2 on the SABmark dataset.

Table 10 TC scores for different weights of secondary structures on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.

II. Effect of relative solvent accessibility information

Similarly, we studied the effect of relative solvent accessibility on the SABmark by adjusting the values of w1 and w3 and setting the values of w2 and wc to 0. The SP and TC scores with respect to different weight values are shown in Tables 11 and 12, respectively. The scores are also plotted against the weights in Figures 4 and 5, respectively. The highest SP and TC scores were achieved when w3 was set to 0.5 or 0.6.

Table 11 SP scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Table 12 TC scores for different weights of relative solvent accessibility on the SABmark benchmark. Bold denotes the two best scores, and an extra superscript of star denotes the highest score.
Figure 4
figure 4

the 2D plot of SP scores against w 3 on the SABmark dataset.

Figure 5
figure 5

the 2D plot of TC scores against w3 against the SABmark dataset.

III. Effect of residue-residue contact map information

We investigated the effect of contact map information on the BAliBASE 3.0 data set by adjusting wc and setting w2 and w3 to 0. We used NNcon to successfully predict the contact maps for subset RV11, RV30, 42 out of 44 alignments in RV12, 38 out of 40 in RV20, 33 out of 46 in RV40, and 14 out of 16 in RV50. We tested the MSACompro method against this data with contact predictions. Tables 13 and 14 show the SP and TC scores for different wc values on the subsets of the BAliBASE dataset. The results show that using contact information improved the alignment accuracy on some, but not all, subsets.

Table 13 SP scores for different weights for contact map on the BAliBASE3.0 database. Red color highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.
Table 14 TC scores for different weights for contact map on the BAliBASE 3.0 database. Red highlights the improved scores on each BAliBASE subset. Bold denotes the increased scores.

IV. Effect of combining secondary structure and solvent accessibility information

We adjusted the values of w1 (weight for amino acid), w2 (weight for secondary structure) and w3 (weight for relative solvent accessibility) simultaneously to investigate the effect of using secondary structure and relative solvent accessibility together. SP and TC scores on different parameter combinations are shown in Tables 15 and 15. The highest score is denoted in bold and by a superscript of 1, the second in bold and by a superscript of 2, and the third in bold and by a superscript of 3. The results show that the highest scores are achieved when w1 ranges from 0.4 to 0.5, w2 from 0.4 to 0.5, and w3 from 0.1 to 0.2. Also, using both secondary structure and solvent accessibility improves alignment accuracy over using either one. The best alignment score, which uses both secondary structure and solvent accessibility, is >8 points higher than the baseline approach, which does not use them. The changes of SP scores and TC scores with respect to the weights are visualized by the 3D plots in Figures 6 and 7. We conducted similar experiments on BAliBASE 3.0 and OXBENCH and got the similar results (data not shown).

Table 15 SP scores for different weight combinations (w1 - amino acid, w2 - secondary structure, w3 - solvent accessibility) on the SABmark 1.65 dataset.
Table 16 TC scores scores for different weight combinations (w1 - amino acid, w2 - secondary structure, w3 - solvent accessibility) on the SABmark 1.65 dataset.
Figure 6
figure 6

3D plot of SP scores against secondary structure weight w 2 and relative solvent accessibility weight w 3 .

Figure 7
figure 7

3D plot of TC scores against secondary structure weight w 2 and relative solvent accessibility weight w 3 .

V. Effect of using contact map information together with secondary structure and solvent accessibility information

In order to study whether or not contact information can be used effectively with secondary structure and solvent accessibility, we adjusted the weight wc for contact information, while keeping the w1, w2, and w3 at their optimum values (0.4, 0.5, and 0.1 respectively). Tables 17 and 18 report the SP and TC scores on the BAliBASE 3.0 data set for different wc values from no contact information (wc = 0) to maximum contact information (wc = 1). The results show that the improvement caused by contact information seems not to be substantial and significant.

Table 17 SP scores for different contact map weight wc on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.
Table 18 TC scores for different contact map weight wc on the BAliBASE3.0 database while keeping the weights for amino acid, secondary structure, solvent accessibility to 0.4, 0.5, and 0.1, respectively.

Conclusion

In this work, we designed a new method to incorporate predicted secondary structure, relative solvent accessibility, and residue-residue contact information into multiple protein sequence alignment. Our experiments on three standard benchmarks showed that the method improved multiple sequence alignment accuracy over most existing methods without using secondary structure and solvent accessibility information. However, the performance of the method is comparable to PROMALS and PROMALS3D by slightly lower scores on some subsets and behind it by a large margin on SABMARK probably because these two methods used homologous sequences or tertiary structure information in addition to secondary structure information. Since multiple sequence alignment is often a crucial step for bioinformatics analysis, this new method may help improve the solutions to many bioinformatics problems such as protein sequence analysis, protein structure prediction, protein function prediction, protein interaction analysis, protein mutagenesis and protein engineering.