Background

Transmembrane proteins (TMPs) are non-soluble proteins anchored in a cell membrane and containing one or more membrane-spanning segments separated with intra or extra-cellular domains of variable length. This figure reflects the bi-layer membrane width, though the segments can also be tilted within the membrane, thus requiring more amino acids to span the interval (up to 30). TMPs constitute about 20~30% of all protein coding genes in prokaryotic and eukaryotic organisms [1, 2]. The problem of generating multiple sequence alignments (MSAs) of TMPs was first addressed by [3], and over the last years several packages have been published, specifically designed for that task [46]. To our knowledge PRALINE™ is the only TMPs multiple aligner currently available. PRALINE™ belongs to a category of aligners using the process of homology extension. PROMALS [7] and PSI-Coffee [8] also belong to this category. Homology extension is a method that involves using database searches to replace each sequence with a profile made of close homologues. As a result, in any sequence, each position becomes a column in a multiple alignment, thus reflecting the pattern of acceptable mutations. These patterns are very informative, as they tend to reflect the sum of constraints (mostly functional and structural) that have shaped the diversity observed along the proteins of the same family. A natural consequence of these patterns conservation is the high sensitivity of profile-profile comparisons when doing remote homology search [9].

In this paper we show that the PSI-Coffee homology extension can also be used to reveal and use specific conservation patterns of TMPs like the amphiphilic properties of transmembrane alpha helices and thus yield significant improvements when aligning TMPs. A critical parameter when doing homology extension is to determine the influence of the database used to make the extension. Typical homology extension involves performing BLAST or PSI-BLAST [10] against the NR database [11]. This step is time consuming and results in a prohibitive cost for homology extension, as compared with faster methods. We show here that one can go over this problem by using smaller non-redundant databases. We go even further by showing that a TMP specific database can be used for homology extension at a much lower CPU cost and without any significant reduction in alignment accuracy.

Methods

Homology extension

The process of homology extension involves replacing individual sequences with a set of multiply aligned homologues. Given a dataset, this procedure involves performing BLAST for each individual sequence against a protein database and turning the resulting output into a one-against-all MSA (i.e. query against hits). These MSAs (one per sequence in the original dataset) are then turned into profiles. The purpose of homology extension is to reveal the evolutionary variability associated with each site of the considered sequences thus producing more accurate pair-wise alignments [7]. We used blast+ (version 2.2.25) against various databases (see next section). In practice, homology extension is made automatically by T-Coffee [12] either using the public web service maintained by the European Bioinformatics Institute (default) or using a locally installed BLAST against locally maintained databases.

Databases

Homology extension was carried out against two databases: NR and UniRef [13]. In order to check the effect of redundancy, we used the versions of UniRef non-redundant database (UniRef100, UniRef90 and UniRef50) trimmed at various levels of redundancy. In these versions, the database is modified so as to make sure that no pair of sequences exists with an identity higher than the specified level. UniRef100-TM, UniRef90-TM and UniRef50-TM are even smaller databases produced by filtering the corresponding UniRef dataset with the following query string: "keyword:transmembrane". These TMP specific databases are typically 20% of the size of their sources.

TM-Coffee algorithm

TM-Coffee uses the PSI-Coffee (Position Specific Iterative T-Coffee) mode of T-Coffee to multiply align TMPs. The algorithm can be summarized as follows:

  1. 1.

    Perform BLAST for each query sequence against the selected database with default parameters.

  2. 2.

    Keep hits having a level of identity between 50% and 90% and a coverage higher than 70%.

  3. 3.

    Turn the BLAST output into a profile by removing all columns corresponding to positions unaligned to the query (i.e. gaps in the query) and by filling with gaps query positions unmatched by BLAST.

  4. 4.

    Produce a T-Coffee library by aligning every pair of profiles with a pair-HMM. When doing so, every pair of matched column with a posterior probability of being aligned higher than 0.99 is added to the library. The pair-HMM is adapted from the ProbCons pair-HMM [14] in order to deal with profiles. It uses the ProbCons bi-phasic gap penalty set (i.e. two distinct sets of gap opening and extension penalties for short and longer gaps). The parameter values are those initially reported [14].

Benchmarking

We used as a gold standard the reference 7 of BAliBASE2 [15]. This dataset is made of 435 alpha-helical TMPs classified into eight distinct families that can be multiply aligned. The core regions of BAliBASE defined by the authors examine the alignment of structurally equivalent residues only (Additional file 1). Evaluation is made by assessing the capacity of the methods to recapitulate these core regions, mostly made of alpha helices. Two metrics are used to assess accuracy: the Sum of Pairs score (SP) that estimates the fraction of residue pairs from the reference core identically aligned in the target and the reference MSA and the Total Column score (TC) that estimates the fraction of columns identically aligned in the target and the reference.

Aligners

We used BAliBASE-ref7 dataset to compare PSI-Coffee with the six most accurate methods currently available, MSAProbs 0.9.4 [16], Kalign 2.04 [17], PROMALS, MAFFT 6.815 [18], ProbCons 1.12 and PRALINE™. All methods were run using default parameters. PSI-Coffee is part of the T-Coffee suite (Version 8.99). It was run with default parameters except for the database used for homology extension. This was done with the following command line:

t_coffee <seq.fasta> -mode psicoffee -blast_server LOCAL -protein_db <database> -template_file PSITM

The PSITM template file mode is used here to display a coloured MSA version (.tm_html output file, Figure 1) reflecting predictions carried out by HMMTOP [19] using the profile associated with each sequence. This prediction is only used for display purposes and is not required by the alignment procedure.

Figure 1
figure 1

Typical colour output (tm_html). In this example, the protein Or9a of Drosophila melanogaster and its orthologues of other Drosophila species were aligned with PSITM template. The colour code corresponds to prediction by HMMTOP, where yellow: in loop, red: TM helix, blue: out loop. Notably, the predicted topology of the Or9a set is consistent with the Benton et al.'s conclusion [20].

Results

We first asked whether applying the PSI-Coffee homology extension algorithm on our reference dataset of TMPs could lead to some improvement over existing alignment methods. We did so using the NR database for homology extension. Results (Table 1) show that TM-Coffee outperforms the other methods, most notably when considering entire columns (TC comparison). When doing so, we find an improvement of nearly 10% over PRALINE™. Owing to the small dataset size (eight families), the observed differences are not highly statistically significant, although the differences between PSI-Coffee and the other methods are consistently more marked than the differences between the other methods (Table 2). This increased accuracy comes, however, at a significant computational cost. One may therefore argue that the over-head for turning single sequences into profiles is so significant that it is not worth using this approach for large-scale analysis. In order to address this problem we asked whether one could achieve a similar level of accuracy while doing the homology extension on smaller databases.

Table 1 Comparison between the PSI-Coffee and other multiple sequence alignment methods on each BAliBASE2-ref7 family
Table 2 Statistical significance test of the performance between two methods

When using PSI-Coffee, profiles are built by performing BLAST search for each sequence against NR. This procedure defines the database as a key ingredient of homology extension. It is therefore an interesting question to ask how this parameter may affect the overall accuracy of the procedure. We did so by providing PSI-Coffee with databases of various redundancy levels (UniRefXX), all built upon UniProt, and then realigning the reference datasets. Results (Table 3 and 4, detailed performance per family in Additional file 2) show that the difference in accuracy is very small when comparing to NR. Overall, the accuracy level of PSI-Coffee remains high regardless of the redundancy level. In practice, however, using a UniRef50 means using a database with 50% redundancy and approximately 3.5 times smaller than the full database. As one would expect the CPU requirements of the extension process decrease accordingly and the time required by the alignment goes down to 26,442 as compared with the 72,199 seconds required when using the full database (2.7 times faster).

Table 3 Performance comparison of different database sizes for the BAliBASE2-ref7
Table 4 Statistical significance test of the performance in different databases

Even so, the CPU requirements can be considered excessive when compared with the time needed by the default T-Coffee (that does not need to do any extension), we therefore decided to take advantage of the observation that when doing homology extension with TMPs, one spends a lot of CPU time searching databases mostly made of unlikely TMPs homologues. Indeed, 80% of the proteins in UniRef are non TMPs. We therefore asked if a simple database, built by filtering UniRef on keywords could be used instead of the full DB. This database, named UniRefXX-TM is significantly more compact than its source. UniRef50-TM contains about 100 times fewer sequences than the full UniProt. Results obtained on this database show that using such a reduced protein set for the extension does not result in any trade-off between accuracy and efficiency. The level accuracy is comparable and even superior to that achieved with the default PSI-Coffee while the CPU time requirements are dramatically decreased by a factor 10. We named TM-Coffee the flavour of PSI-Coffee running homology extension against UniRef50-TM. Table 5 shows that TM-Coffee remains relatively slower than non-homology extension based methods but dramatically faster than PROMALS, a well-known aligner using homology extension.

Table 5 Comparison of running time

We finally asked whether one can alter the effect of homology extension by filtering the BLAST output based on e-value and removing distantly related sequences (i.e. remote homologues) that are likely to be inaccurately or only partially aligned. Results are shown in Figure 2. As one can see, no benefit is gained from filtering the BLAST output and the overall accuracy tends to increase when more hits are included. The slight drop on the curve is not statistically significant and can be attributed to a single misaligned sequence. Overall this result indicates that using the default BLAST parameters (NONE in Figure 2) leads to profiles where low quality hits only have a negligible impact on the overall alignment accuracy, probably because remote homologues tend to be filtered out through their low coverage.

Figure 2
figure 2

Line chart of the average TC respect to different e -value thresholds on UniRef50-TM database. The number of homologues is counted by summing all homologues found in eight families and plotted in log10 scale. The standard error of TC score cross eight families is the range of dash line. SP is skipped due to minor change respect to different e-value thresholds.

Discussion and conclusions

In this work we show that homology extension can be used to significantly increase the accuracy of transmembrane protein multiple sequence alignments. When considering entire columns (the most stringent measure of multiple alignment accuracy), our results suggest that PSI-Coffee is about 10% more accurate than MSAProbs, the next best method. This improvement comes, however, at a cost and we show that the default PSI-Coffee requires about 30 times more CPU time than simpler methods. We therefore explored the possibility of using more compact non-redundant databases and found that when using a database trimmed to 50% redundancy and containing only sequences annotated as TMPs, we could achieve the same level of accuracy as PSI-Coffee while only requiring a tenth of the CPU time. This new protocol is named TM-Coffee.

Regardless of the improvement reported here in terms of CPU, TM-Coffee remains a relatively slow method. One may argue whether the increased computation cost is worth the improvement reported here. There is no simple answer to this question. For instance, if we consider Kalign and TM-Coffee, the difference in CPU requirement is about a thousand fold. The difference in accuracy at the column level, however, is about 28%. These are major differences, bound to dramatically affect any modelling based on an MSA. Of course, one may argue that the column score can be affected by a single misaligned sequence and is therefore an amplification of reality. This is probably true for some applications of MSAs, yet many circumstances exist like homology or phylogenetic modelling where the misalignment of a single sequence can have a major impact on the conclusion drawn upon the analysis of a dataset.