Background

Helical transmembrane proteins are important for their involvement in cellular mechanisms, which makes them an important class of drug target. The biological function and mechanism of action of these proteins are determined by their three-dimensional (3D) structure, and membrane helices are signature elements of these structures [1]. Thus, predicting the location of membrane helices from protein sequence can provide powerful constraints for inferring 3D structure and in turn can assist to elucidate their molecular mechanisms. This endeavour is particularly important for membrane proteins, for which relatively few unique structures have been experimentally determined.

Sequencing entire organism genomes has brought an explosion in the number of available protein sequences. It is estimated that 20-30% of sequenced genomes code for helical membrane proteins [2], a figure in stark contrast to the ~1% of 3D structures determined experimentally for helical membrane proteins. This discrepancy in representations spurs development of methods for predicting membrane helices from sequence, as this permits better identification of pharmaceutically significant membrane proteins.

The topography of membrane helical segments - the description of their location in the amino acid sequence - is the focus of many current prediction methods. The most recent methods also predict whether the N-terminal and non-membrane loop segments between membrane helix segments are on the inner or outer side of the membrane - the topology of the membrane helices, which is related to their orientation in the membrane. A range of algorithmic strategies are used by prediction methods, including: amino acid hydrophobicity and other biophysical characteristics; evolutionary information in the form of multiple sequence alignments, and; machine learning strategies such as hidden Markov models, neural networks and support vector machines, which are trained on sequence data of known membrane helices.

No one membrane helix prediction method scores well for all types of scoring criteria [2]. Many methods predict the most commonly observed types of membrane helices, but many methods do not also predict the less frequently observed helices such as the half-membrane helices of the ion channels. This is because the methods have been optimised to predict only transmembrane helices that completely cross the membrane [1, 2]. It is now recognised that half-membrane helices that only partially cross the membrane have importance as they constitute signature structural elements of membrane helical protein families, such as the potassium channels, aquaporins, chloride channels, the glutamate homologue transporter, and the protein conducting channel [3]. Half-membrane helices have previously been inventoried and classified as being re-entrant loops consisting of either a helix-turn-coil, coil-turn-helix or helix-turn-helix [4]. Recent x-ray crystallographic protein structures are now revealing more diversity of half-membrane helices such as: discontinuous non-re-entrant half-membrane helices joined by extended 5-7 residue loops in respiratory complex I [5]; a half-membrane helix connected at 70° by a 10-residue hinge to a membrane interface helix in the maltose transporter [6, 7]; non-re-entrant half-membrane helices in the formate transporter [8] that are structurally homologous to the re-entrant aquaporin helix, and; a re-entrant partially 310-helix in a photosystem I structure that is parallel to the membrane plane in place of a hairpin turn [9].

As membrane helix prediction methods are developed and improved, there is a continuing need to evaluate and compare their performances, to both aid method development and to directly evaluate method applicability. A benchmark tool for calculating and comparing accuracies of membrane helix topography and topology predictions from sequences would fill this need. Independent evaluations of existing membrane helix prediction methods have been conducted [10, 11] but do not include important recent methods. The publications for the most recent prediction methods do report benchmark results of the method compared to a limited set of available prediction methods [3, 4, 12-16]. No benchmark has comprehensively evaluated the predictive power for specialised classes of transmembrane proteins using high resolution data of known protein topologies as the benchmark standard. A recent study by Tsirigos et. al. in 2012 [17] reports a comparison of 18 prediction methods, most of them recent methods, but does not provide the ability for users to run evaluations. Finally, since the gold standard for evaluating membrane helix prediction accuracy is to compare the predictions to known membrane helix positions in high resolution solved 3D structures, it is important that evaluations incorporate recent experimental structures. Many novel three-dimensional structures for membrane proteins were solved recently and were not available for the past evaluations.

Construction and content

The benchmark server presents the user with options for controlling the inputs and outputs of the server. The inputs are: the prediction methods to be compared; the sequences on which the selected prediction methods operate, and; the reference helix assignments against which the helix predictions are compared. The outputs which the server can generate are the results of the benchmark defined by the inputs, or more detailed information about the inputs. Detail for all these parameters is given in the following subsections.

Prediction methods

The server makes available a total of 52 sequence-based prediction methods, which break down to 24 topographical methods, 27 topological methods, and 1 method which seeks to predict membrane dipping re-entrant loops (TMLOOP [3]). The topographical methods seek to assign membrane-helix character to segments of input amino-acid sequences, whereas topological methods seek to assign to predicted membrane helices an orientation with respect to the membrane interfaces (inner or outer).

The available methods are those which were freely accessible to be run in batch mode. During the implementation of this server, all of the prediction methods were applied to all the sequences available on the server (which are detailed in the next section), and the results cached to speed up the user experience. When this caching was performed, the prediction methods were all run with default parameters. Methods, method types and parameters are listed in Additional file 1 Table S1.

Sequence data

The protein data used by the server to benchmark predictions were sourced from the wwPDB [18] in February 2012 and consists of 1045 unique amino acid sequences. These 1045 sequences break down to: 481 sequences from polytopic or bitopic helical membrane proteins; 95 sequences from β-barrel membrane proteins, and; 469 sequences from soluble proteins.

Users can configure a subset of these sequences to be used in the benchmark of prediction methods according to a number of sequence attributes. These attributes are: similarity level; phylogenetic kingdom; transmembrane helix profile (bitopic or polytopic only, half-membrane helices or not); membrane protein structure family (as assigned by the Membrane Proteins of Known 3D Structure database [1]); experimental resolution; experimental method, and; year of submission.

The server defaults to including only sequences from helical membrane proteins with a similarity threshold of 30%, experimental resolution better than or equal to 3.5 Ångström, and experimental methods of x-ray diffraction or Solution NMR - a set of 392 sequences. Users may choose to include the sequences for β-barrel membrane proteins or soluble proteins in order to test the 'false positive' rate of prediction methods, which would be useful information when evaluating a prediction method's performance for genome-level scanning. Similarly, the options for selecting sequence data according to kingdom will assist users for the evaluation of genome scanning potential.

The set of sequences for helical transmembrane proteins has diversity in the types of helices it contains. Of the 392 default transmembrane protein sequences, 65 sequences contain one or more half-membrane helices or re-entrant loops in addition to transmembrane segments, and the remaining 327 sequences contain only transmembrane helices that completely cross the membrane. The one soluble protein dataset was derived from PDBselect25 [19] of March 2012, and reduced from 25% to less than 1% similarity by using psi-cd-hit [20].

The sets of sequences of varying levels of similarity were pre-computed using “algorithm 2” from [19], with similarity being evaluated using the EMBOSS [21] global alignment [22] and the EMBOSS local alignment [23]. The sets generated offer similarity levels from 20% to 100% with steps of 5%. For each kingdom the top 10% of the soluble sequences least similar to the helical membrane dataset sequences were retained for the soluble protein dataset, with similarity having been determined by the identities metric for matches having E-value less than 0.005 [24]. The structure-function family classification of sequences from helical membrane proteins is according to the Membrane Proteins of Known 3D Structure database [1]. The year of submission option allows users to include only sequences submitted after a certain date, which can aid in the detection of training bias in the prediction methods being surveyed which may have trained on the same data.

It is also possible for the user to upload their own predictions for the selection of sequences. This allows the comparison of novel prediction methods against the many existing methods. The sequences for a selection can be retrieved by choosing the appropriate output option, as described in the following sections.

Reference helix assignments

Performance of topography and topology predictions is measured against membrane helix assignments in high resolution three-dimensional structures from the Protein Data Bank (wwPDB) [18]. The user can choose from 4 sets of reference helix assignments: OPM membrane helices; OPM adjusted membrane helices; PDBTM membrane helices, and; PDBTM membrane helices and loops. The server defaults to OPM adjusted membrane helices.

The helices in a solved 3D structure can deviate from the definition of a canonical helix, and the location of the membrane with respect to the helix can not be definitively determined from crystal and NMR structures. The definition of the membrane regions of proteins available in the server can be chosen from the Orientations of Proteins in Membranes (OPM) database [25] or the Protein Data Bank of Transmembrane Proteins (PDBTM) [26, 27]. Manual visual comparison of the membrane helices common to structure-function families, as assigned by the Membrane Proteins of Known 3D Structure database, permitted the identification of short membrane helices that had not been identified as OPM membrane segments in some of the members of the family. The reference helix assignment dataset that includes these is referred to in the server as “OPM adjusted membrane helices” and hereafter is abbreviated to “OPM-adjusted”. PDBTM classifies short membrane helices as loop that includes the coil part of the re-entrant helix, and the server optionally allows these to be counted as membrane helices. This loop-inclusive reference helix assignment dataset is referred to as “PDBTM membrane helices and loops”. The OPM- and PDBTM-assigned membrane helices differ by an average of 2 residues per helix boundary definition.

For topology assignments, the benchmark server uses the assignments reported in OPM. The PDBTM assigns the two sides of the membrane without specifying which is inside or outside, and these assignments were compared to the OPM assignments to arrive at inside/outside topology assignments for benchmarks using PDBTM topography assignments. The term 'outside' is used to refer to the extracellular face of the membrane, and 'inside' to the other side. Three-dimensional structure determinations do not inherently determine how the protein is positioned in the membrane, making this processing necessary.

Outputs

Once the parameters of a benchmark have been chosen, the user can select from a number of operations to perform. The default operation is to execute the benchmark and receive the benchmark results. Other operations available involve retrieving further information about the selected parameters the user has chosen for the benchmark. This further information includes the aligned predictions for the selected prediction methods and sequences, or the selected sequences in a variety of formats. These options allow the user to perform their own predictions on the chosen sets of sequence data, which can be uploaded to the server and included in benchmarking.

When a benchmark is performed, the results consist of a number of scores with differing granularity, each of which describes a different feature of prediction. The scores are divided into topography scores and topology scores.

For topography scores, the levels of granularity are:

  1. 1.

    per protein sequence accuracy, which measures the percentage of protein sequences for which all membrane helices are predicted correctly

  2. 2.

    per segment accuracy, which measures the performance of predicting individual helices and has two components:

    1. 2a.

      sensitivity, which is the percentage of reference helices which are correctly predicted

    2. 2b.

      specificity, which is the percentage of predicted helices which are actually in the reference helix dataset

  3. 3.

    helix boundary accuracy, which measures the ability of methods to correctly predict the residues where a helix begins and ends and has three variations

  4. 4.

    per residue accuracy, which measures the ability of methods to correctly assign specific helix characters to individual residues and has a number of variations

For topology scores, the levels of granularity are:

  1. 1.

    per protein sequence accuracy, which has two components:

    1. 1a.

      localisation, which measures the ability of methods to correctly assign localisation within the membrane environment to all segments of the protein chain

    2. 1b.

      orientation, which measures the ability of methods to correctly assign the N-terminal end of a protein chain to the correct locale in the membrane environment

  2. 2.

    per segment accuracy, which measures the ability of methods to correctly assign the orientation and localisation within the membrane environment to individual segments of the protein chain

  3. 3.

    per residue accuracy, which measures the ability of methods to correctly assign the orientation and localisation within the membrane environment to individual residues

The topography per-protein-sequence, per-segment, and per-residue measures provided by the now unavailable transmembrane helix benchmark server of [28] are included and extended in this new benchmark server to include topology and helix boundary prediction accuracy measures and Matthews Correlation Co-efficients (MCC) [29].

Some of these metrics were reported in other previous benchmarks [13-16]. Although per-residue scores are provided, it is the per-segment scores, also known as segment overlap (Sov) scores [30, 31], that can be considered as the more informative metrics. This is because it is the secondary structure type (α-helical, β-barrel, or coil), position, and number of secondary membrane structure segments that characterise structure and function [1, 31]. The differences in OPM and PDBTM assigned helix boundaries show that residue-level helix assignments are not unambiguously agreed. As an example of how a per-residue score can be misleading, predicting a highly α-helical protein to be entirely helical gives a high per-residue score, inflating the perceived performance of the prediction method [32]. The metrics provided by this benchmark server and their formulae are listed in Additional file 1 Table S2.

A visual comparison section displays observed versus predicted membrane helix positions and inside/outside topology for protein amino acid sequences to the amino acid level of detail, revealing problematic helices and topology segments in detail.

Utility and discussion

The benchmark server was used to carry out benchmarks of all the transmembrane helix prediction methods available on the server, making use of the wide variety of parameters. Five major benchmarks were performed to test for different measures of accuracy: sensitivity; specificity; correctly predicted sequences; topology, and; helix boundaries. The results of these benchmark questions are presented in Table 1. To illustrate the flexibility of the server for carrying out specialised evaluations, specialised benchmarks for predictions of membrane channel helices were carried out and are reported in Table 2.

Table 1 Benchmark results showing prediction methods and their scores for benchmark measures
Table 2 Sensitivity benchmark results for predictions of families of membrane channels

These benchmarks were all performed using the standard interface to the server and can be performed by any user.

Benchmarking considerations

Using the server's data selection features, specifically appropriate benchmark data subsets were chosen to illustrate the extent of differences in prediction accuracy for the different benchmark metrics involved in each benchmark question. These data subsets are specified in Table 3. For sensitivity benchmarks, only membrane sequences containing at least one membrane helix were included in the benchmark, so as not to dilute the differences in sensitivity accuracy with statistics for sequences not containing any membrane helices. However, when performing specificity benchmarks, it is highly relevant to include sequences that do not contain membrane helices so that false positive rates may be assessed, and so specificity benchmarks included such sequences. Specificity benchmarks were also carried out on datasets where the only included membrane sequences were those containing at least one membrane helix.

Table 3 Characteristics of the subsets of data used for the benchmarks reported in this paper

For topology assessments, benchmarks of datasets both containing and excluding half-membrane helices were performed, because difficulties in predicting half-membrane helices can adversely affect the topology performance. For helix boundaries assessments, benchmarks were carried out separately for the benchmark data using OPM-defined membrane helices versus that of PDBTM, because these two definitions do not assign helix boundaries identically. Apart from this assessment, all benchmarks were carried out using only the OPM-adjusted membrane helix assignments. All benchmarks were carried out on a data subset where member sequences were restricted to less than 30% similarity to other sequences in the subset, with similarity having been measured by EMBOSS global sequence alignment. Unless otherwise specified, the default benchmark server parameters were used.

For the specialised benchmarks on membrane channel predictions, the benchmark data were restricted to only sequences belonging to specific membrane protein structure families, as specified in Table 4. As there are not many benchmark sequences available in the server for each family of membrane channels, benchmark dataset sequences were not restricted by similarity to each other. For all other parameters the benchmark server defaults were used.

Table 4 Counts of sequences and membrane helices of the data subsets used for the specialised benchmarks for channels reported in this paper

The prediction methods were all used with their default parameter settings, and so the benchmark results are presented with the caveat that the methods may perform better when their parameters have been optimised for the specific prediction question being judged by the benchmark.

Sensitivity

The highest scoring methods for sensitivity are prediction methods using a range of different algorithms and information, such as machine learning, biophysical properties, sequence alignments and consensus, with no one strategy showing superiority at predicting membrane helices with sensitivity. These benchmarks also demonstrate that a consensus method (TOPCONS) [33] scores lower than the highest scoring method used to compile its consensus (PRODIV-TMHMM) [15, 33].

To investigate how well prediction methods perform on data that was not used to calibrate the method, the benchmarks were repeated restricting benchmark data to the wwPDB structures that were released in 2008 or after and not having any similar sequences in the wwPDB before then. The resulting scores were on average 4% lower. This result, and the observation that older machine learning methods do not perform as well as newer machine learning methods in the overall sensitivity benchmarks, suggests that machine learning method prediction sensitivities might benefit from the methods being retrained on the latest available data, and demonstrates that prediction methods generally do not perform quite as well for sensitivity on data that is not similar to that which was used for the method creation. Machine learning prediction methods are highly represented in the set of top scores of sensitivity benchmarks, with 5 out of 7 of the highest scoring methods reported in Table 1 being machine learning methods.

Purely biophysical methods do appear as top scoring sensitivity methods - VALPRED2 [34] is in the list of top 5 methods, and SCAMPI-multi [33, 35] is in the top 10 methods. This may indicate that biophysical based methods developed in the future using knowledge about the forces that drive the membrane helix formation process have the potential to give superior prediction sensitivity performance. However, the simpler biophysical based methods using scores based only on hydrophobicity perform significantly worse than the other methods.

Specificity

The highly sensitive prediction methods, with the exception of VALPRED2, have high specificity scores of 95% and above for the benchmark on membrane helical sequences, showing the welcome result that sensitivity has not come at the expense of generating many false positives. The list of the prediction methods obtaining the highest specificity score changes completely when the β-barrel and soluble sequences are included with the membrane helical sequences in the benchmark. This indicates that the choice of prediction method, having the best specificity, to employ should depend on whether the predictions are on proteins that are already known to be helical membrane proteins - as is the case of detailed investigations of specific membrane proteins - or are known to contain soluble proteins too - as is the case for genome annotation.

Correctly predicted sequences

The highest performing method in benchmarks for correctly predicting all and only all observed membrane helices in helical membrane protein sequences is OCTOPUS [16, 36], and as was the case with specificity, the list of best methods changes completely when β-barrel membrane and soluble protein sequences are included in the benchmark, with OCTOPUS dropping to 28th place. These benchmark results suggest a two-pronged approach is appropriate for the task of automated genome annotation, which requires correct sensitivity and specificity. First use a prediction method that exhibits high specificity for discriminating between membrane helical and non membrane helical sequences to identify sequences having membrane helices, and for those sequences, use OCTOPUS predictions for the actual membrane helix annotation, thus avoiding the false positive predictions of OCTOPUS for non membrane helical proteins.

Topology

The OCTOPUS and MEMSAT-SVM [14] prediction methods predict re-entrant segments and adjust the topology prediction accordingly by predicting both sides of the re-entrant segment to be on the same side instead of alternating inside/outside. However, the MEMSAT3 [37, 38] method scores better, and other methods score almost as well as OCTOPUS, and better than MEMSAT-SVM, for correctly assigning inside/outside topology - even though they don't predict re-entrant loops. This is due to those methods not considering re-entrant loops altogether, thus removing the possibility of putting the alternating inside/outside topology prediction out of order.

Membrane helix boundaries

OCTOPUS is consistently best in benchmarks for predicting helix boundaries to the residue regardless of whether the OPM or PDBTM data are used as the benchmark data, even though OPM and PDBTM do not always assign the same helix boundaries as each other.

Specialised benchmarks for channels

Benchmarks of membrane helix predictions for channel families were performed. The results are shown in Table 2 and suggest which prediction methods are best for predicting membrane helices in the different channel families, and show that not all methods predict these signature membrane helices.

Conclusions

This benchmark server for assessing predictions of membrane helices from sequence contains recent high resolution 3D structure data and thus provides the most accurate benchmark currently possible. Prediction of membrane helices from sequences continues to be a valuable activity and it is appropriate that recently available 3D structure data be used for the benchmarking of such predictions. We reported the results of various analytical benchmarks carried out by this server.

The benchmark server provides sub-categorisations, combinations, and customisation of the benchmark data, providing the ability to customise benchmarks for specific benchmark purposes. This allows users to assess which are the best prediction algorithms for varied applications. The data-selection capabilities, coupled with the ability to enter and benchmark the results of novel prediction methods, permit the tuning and assessment of novel prediction algorithms. The use of this server provided insights into currently available prediction methods. For example, benchmark results from this server suggest a two-pronged approach for membrane helix genome annotation, given the discovery that the prediction methods most sensitive to clean data are usually insensitive to 'messy' data, which would be the case with genomes. The server also suggests the possibility of training bias in the machine learning methods surveyed.

We present this benchmark server as a tool for comparing current membrane helix prediction methods, and for comparing novel methods against current methods so that one might meaningfully evaluate their performance and suitability to a variety of bioinformatic tasks.

Availability and requirements

Project name: Benchmark of Membrane Helix Predictions from Sequence

Project home page : http://sydney.edu.au/pharmacy/sbio/software/TMH_benchmark.shtml

Operating systems : The user accesses the benchmark server through a standard Internet web browser. The server runs on a Linux platform.

Programming language : Perl

Other requirements : There are no other requirements for the user other than a computer Internet web browser.

License : The Perl software of the benchmark server will be released under an open source software license such as GNU General Public License or Creative Commons license for Free Software Foundation's GNU General Public License at creativecommons.org.

Restrictions to use by non-academics: There are no restrictions on use of this benchmark server.