In the past, a protein's structure was usually experimentally determined after its biological role had been thoroughly elucidated, and the structure was used as a framework to explain known functional properties. This led to the view that the reliable prediction of the structure of a protein from its sequence would almost automatically provide information about its function. The good news is that methods for predicting structure from sequence can now produce good models for a substantial fraction of the protein space [1]. But the idea that knowledge of a protein's structure is sufficient for functional assignment has needed revision [2]. Many proteins of known structure are not yet functionally characterized and their number is increasing. The investigation of sequence-function and structure-function relationships has therefore become a fundamental necessity. Understanding these relationships will be crucial for moving from an inventory of protein parts to a more profound understanding of the molecular machinery of organisms at a systems level.

This review describes progress in developing both sequence-and structure-based methods for function prediction. There are many current methods, which use a variety of different approaches (Figure 1), and their integration is a major challenge. One approach to this problem has been employed by the BioSapiens Network [3] (see Box 1), of which we are members, through which several new methods have been developed and predictions integrated using the Distributed Annotation System (DAS) [4] (see Box 1). DAS allows different laboratories to 'combine' their annotations, produced by both experimental and computational approaches, to generate a 'composite' annotation at all levels. A 'protein sequence ontology' to facilitate comparison of the annotations has also been developed (G Reeves, personal communication).

Figure 1
figure 1

Automated strategy for assigning function to proteins. The various approaches to protein function prediction are described in the text. Both protein sequences and structures can provide information for family classification and functional inference. Sequence-based methods make use of different strategies for grouping proteins into families (for example, sequence tree construction based on clustering of all against all sequence comparisons) or they compare the target sequence with pre-compiled databases of families. When a structure is available, the whole structure can be scanned against precompiled sets of functional sites. Alternatively, fragments of the target protein can be used to identify any structural similarities in the conformation of proteins of known structure, possibly related to a molecular function. Both sequences and structures, together with protein-protein interaction data, can be used to infer interactions, which can provide functional clues. Ideally, an independent set should be used to assess the reliability of the various methods.

Box 1
figure 2

Glossary of terms

We will first focus on methods that attempt to extract functional information from protein sequences, which generally exploit the power of alignment and clustering, and then discuss strategies that use protein structure information. When an experimental three-dimensional (3D) structure is not available, such methods can, in principle, be applied to modeled structures, although the quality of the model will dictate which methods can be applied (for a review, see [5]). We will also briefly discuss tools that exploit interactions between proteins as a means of inferring their function and survey systematic assessments of the effectiveness of function-prediction methods.

Sequence-based classification of proteins

The first hurdle for any functional annotation process is to define 'function'. If the protein is an enzyme, then simply using the EC numbering scheme (see Box 1) can be useful. In general however, the problem is multi-dimensional: a protein can have a molecular function, a cellular role, and be part of a functional complex or pathway (these are the distinctions used in the Gene Ontology (GO; see Box 1) [6]). Furthermore, certain aspects of molecular function can be illustrated by multiple descriptive levels (for example, the coarse 'enzyme' category versus a more specific 'protease' assignment). Even the more detailed definition would not reveal the cellular role of the protein (apoptosis, metabolism, blood coagulation, and so on).

Most function-prediction methods, both sequence and structure based, rely on inferring relationships between proteins that permit the transfer of functional annotations and binding specificities from one to the other. A notable challenge here is deciphering the connection between the detected similarities (structural or in sequence) and the actual level of functional relatedness. Function is often associated with domains, and another problem is the identification of functional domains from sequence alone. The accuracy of current methods for predicting domain boundaries is not yet completely satisfactory. Several methods provide reliable predictions if a structural template for the protein is available, but when this is not the case, one is left with the problem of whether the experimental annotation used for the inference refers to the same domain for which the sequence similarity/motif is established [7].

The function of a protein can also be inferred from its evolutionary relationship with proteins of known function, provided that the relationship is properly inspected. Orthologous proteins in different species most often share function, but paralogy (that is, divergence following duplication of the original gene) does not guarantee common function. Distinguishing between orthology and paralogy can be attempted on the basis of observed sequence-similarity patterns, by analyzing the specific conservation pattern of residues responsible for function in the family, or on the basis of the protein structure (either experimentally determined or modeled). In all cases, this requires the clustering of proteins into evolutionary families, which can be achieved using similarity-detection tools such as BLAST [8] or profiling tools based on multiple sequence alignments, for example, PSI-BLAST [9]. Several available resources provide pre-compiled family assignments for proteins on a genomic scale, based only on their sequence. Resources can be subdivided into those that consider full-length sequences and those based on domains or motifs that map to certain sub-sequences. In both cases, the degree of granularity of the classification is important, as this is related to the level of functional features that a group of proteins is expected to share.

A resource that classifies full-length proteins is PIRSF [10], in which a set of rules is applied to define primary and curated clusters that are also based on textual (protein names, literature) and parent-child relationships. These clusters (named superfamilies) are further divided into those with full-length similarity (that is, common domain architecture) and those sharing an ancestral domain. PIRSF covers more than two-thirds of the protein sequence space.

Studying proteins at a domain level allows more accurate functional inference [11] and is useful for predicting the function of novel domain combinations that possibly give rise to new protein functions [12]. In this type of resource, a family of domains is represented as a multiple sequence alignment, which is embodied in a statistical family signature profile (for example, CDD [13] and PROSITE [14]) or a profile-hidden Markov model (for example, Pfam [15] and SMART [16]), collectively referred to here as profiles. Pfam, a prototype for such collections, currently contains more than 9,000 family profiles and covers roughly 70-74% of UniProt sequences, capturing about half of their amino acids [17]. About 40-45% of Pfam families are associated with known structures, whereas 20-25% are currently uncharacterized. Other resources, for example CDD, use externally defined profiles to provide rapid assignments to sequence queries, using a BLAST-like engine to speed up searches.

Profile-based methods and resources differ significantly in their level of automation, their degree of manual curation, and the level of independence from complementary resources used in the classification. Combination of these resources provides a more comprehensive coverage, as reflected by InterPro [18], a repository of protein families integrating signatures from more than 10 member resources, currently covering nearly 75% of UniProt sequences. InterPro also includes Gene3d [19] and SUPERFAMILY [20], which provide sequence profiles corresponding to the structural classification of folds by CATH [21] and SCOP [22], respectively. A resource exploiting the multiplicity of essentially complete genome sequences is COG (Clusters of Orthologous Groups), an evolutionary classification that uses comparative genomics principles, such as phyletic profiles [23] (see Box 1), to identify the presence of orthologs, and group them accordingly.

A notable shortcoming of the methods described above is that they require definition of a threshold similarity for separating families from each other. An alternative approach to defining clusters is the construction of a tree representation that can provide a hierarchical view. Resources in this category include ProtoNet [24], CluSTr [25] and SYSTERS [26]. They are based on sequence similarities detected by an all-against-all sequence comparison, so that any level of evolutionary granularity can be inspected, from closely related subfamilies to more distant relationships.

Approaches that do not rely solely on supervised annotation of family profiles include ProDom [27], which collects putative domain profiles using known sequence domains as query sequences for iterative PSI-BLAST searches [9]. EVEREST [28] is a fully automatic unsupervised method that identifies recurrent conserved regions on the basis of local sequence similarities and iterative profile searches.

The accuracy of sequence-based methods is affected by the type and amount of information on the specific protein family but, overall, they seem to be reasonably accurate. Their success rate has been shown to be greater than 70% when tested on a limited dataset (all structures solved by the Midwest Center for Structural Genomics during the first five years of the Protein Structure Initiative) [29].

Structure-based methods

As homologous proteins evolve, their 3D structure often remains more conserved than their sequence [30]. Consequently, similarities in protein structure can be more reliable than sequence similarities for grouping together distant homologs, which often retain some aspect of a common biological function [31]. The two most comprehensive structure-based family resources, CATH [21] and SCOP [22], classify domains into evolutionary families and into coarser structural classes. Although both resources use some automated protocols, domain assignments are primarily made by expert manual validation. CATH differs from SCOP only in its use of structure-comparison algorithms (for example, CATHEDRAL [32]) and of hidden Markov model-based approaches to provide guidance to curators during classification (see Box 1). Creating functional subfamilies within superfamilies in CATH or SCOP not only permits the analysis of functional divergence with respect to structure, but can be used as a basis for structure-based function prediction. SCOP provides a level below superfamily that groups together closer homologs, often with more similar functions, and work is currently under way to offer similar information in CATH.

The first step in predicting function from structure is often to use global structure comparison: that is, to compare a query protein structure to domains in the structure databases. Although not directly coupled to a curated domain-family resource such as CATH and SCOP, other global structure-comparison methods (for example, DALI [33], MSDFold [34], VAST [35], CE [36], STRUCTAL [37] and FATCAT [38]) can identify structural neighbors in the Protein Data Bank (PDB) [39] (see Box 1), which may share functional similarities. Regardless of the algorithm used, care must be taken when transferring function from one protein to another, as two proteins may have a similar fold yet different functions (for example, the TIM-barrel scaffold).

Some algorithms exploit data on structural families to improve function prediction. The GASP method [40] applies a genetic algorithm to build templates made up of conserved residues in a given family of structures, which are evaluated on their ability to recognize other family members against a background of SCOP domains, when scanned using SPASM [41]. The DRESPAT [42] algorithm also identifies patterns within a family of proteins. The resulting structural motifs can be used to identify binding sites and to assign function to new structures.

Global structure-comparison methods are a useful first step for function assignment, but they do not discriminate between conservation of the overall fold and of functionally relevant regions of the protein. Other methods focus on more localized regions that might be relevant to function, such as clefts, pockets and surfaces. As the ligand-binding site or active site is commonly situated in the largest cleft in the protein [43], the identification and comparison of such regions can suggest putative functions. SURFNET [44] detects clefts by fitting spheres of a range of sizes between the protein's atoms and this approach has been enhanced by combining SURFNet with ConSurf [45] to identify only clefts that are close to evolutionarily conserved residues, as defined by the ConSurf-HSSP database [46]. Another surface-comparison method, pvSOAR [47], identifies similar surface patterns on the basis of geometrically defined pockets and voids. This approach and the associated CASTp [48] database have been used to create the Global Protein Surface Survey (GPSS). Functionally relevant surfaces (binding ligands, metals, DNA or peptide) are extracted through generation of an exclusion contact surface obtained by measuring the difference in solvent accessibility between a structure with and without a neighboring molecule.

Other pocket-centric approaches use the physicochemical properties of the local environments in the pockets and surfaces to describe protein-ligand interactions and active-site chemistry. For example, FEATURE [49] represents local microenvironment using various physical and chemical properties from atomic or chemical groups, from single residues up to secondary structure. Similar approaches are those of SiteEngine [50] and the recently released SURF'S UP! service [51].

Other methods target specific active-site residues (such as catalytic clusters and ligand-binding sites). These approaches utilize a variety of template-based scans to identify active sites and putative ligand-binding sites, the rationale being that the 3D arrangement of enzyme active-site residues is often more conserved than the overall fold. Templates can be derived manually by mining the literature and assessing which residues form the active site (for example, the Catalytic Site Atlas [52]), or can be generated automatically, as in PDBSiteScan [53, 54], which uses the SITE records in PDB files and protein-protein interaction data to generate its templates. The Catalytic Site Atlas has been automatically expanded to include homologs identified by PSI-BLAST [9] and a new webserver (Catalytic Site Search) allows users to query the database directly [55].

Conventional template-searching tools scan the structure of the uncharacterized protein against a database of templates. This idea has been turned on its head with the 'reverse template' approach (initially developed as part of the ProFunc server [56]), which fragments a query protein into many putative templates and scans each of them against the PDB to identify similarities. A stand-alone version, Tempura, has recently been released at the European Bioinformatics Institute [57]. A similar approach is used by the PINTS (Patterns In Non-homologous Tertiary Structures) server [58], which detects the largest common 3D arrangement of residues between any two structures, the assumption being that similar arrangements of residues might imply relatedness of function. The latest addition to automatic template generation uses the Evolutionary Trace (ET) approach [59]. ET uses phylogenetic trees to rank residues in a protein by their evolutionary importance and maps these onto the structure, the highest-ranking residues tending to cluster on the protein surface in functionally important sites. This approach has been developed to build an automated Evolutionary Trace Annotation (ETA) pipeline [60, 61] to identify functional sites, extract representative 3D templates and search for relevant geometric matches in other structures. Other template-centric approaches include Fuzzy Functional Forms (FFFs) [62] and SPASM/RIGOR [41].

Each method has its pros and cons and no single method is always successful. As a result, metaservers (see Box 1) have been developed that aim to combine many services to provide a consensus view that can often help researchers to identify the most likely functional predictions. The ProFunc server [56] is one such resource. It utilizes many of the previously described sequence-based and structure-based methods to present a summary of the most likely functions, represented by GO terms, in an intuitive and well linked web interface. A detailed benchmarking of ProFunc on structural genomics targets showed that, for the most successful methods, functional clues could be derived for approximately 60% of target proteins, of which about 70% were confidently predicted [29].

Another metaserver, ProKnow [63], also combines information from sequence and structural approaches, including fold similarity (DALI [33]) and templates (RIGOR [41]), with functional links taken from the DIP database of protein interactions [64]. The ProKnow authors quantified the level of the assigned function by the ontology depth (from 1 = general to 9 = specific) and showed that they can reach 89% correct assignments at ontology depth 1 and 40% at depth 9, with 93% coverage of 1,507 distinct folded proteins.

Finally, although technically not a structure-based approach, JAFA (Joined Assembly of Function Annotations) [65], is a metaserver that queries several function-prediction servers with a protein sequence to return a summary of predicted GO terms.

Protein interactions

Protein interactions provide a natural context for describing how these molecules catalyze metabolic reactions, build molecular machines and transmit cellular signals. The availability of high-throughput interaction data has enabled the 'guilt by association' principle to be applied to elucidating protein function. The exploitation of observed or predicted physical interactions to assign function is, however, complicated not only by the generally low quality of high-throughput data [66] and the sparseness of reliable interaction datasets derived from literature [67], but also by the sheer size of the problem. Recent estimates indicate that around 50,000 interactions may exist in yeast and more than 300,000 in human [68]. From the available 3D structures of protein complexes, the existence of around 10,000 distinct interaction types, defined by the particular mutual arrangement of their constituent subunits, has been proposed [69].

The vast variety of molecular interactions can, however, be reduced to a limited number of recurrent domain-interaction types. The domain composition of a protein can thus give functional clues. The iPfam [70] and 3did [71] databases provide pre-computed structural information about interactions for Pfam domains. When no 3D structure is available, domain interactions can be inferred: by identifying domain pairs that are significantly overrepresented in interacting proteins [72]; by coevolutionary analysis; by identifying correlated mutations that preserve favorable physico-chemical properties of putative interaction interfaces [73]; or by detecting correlated phylogenetic distributions due to coevolution [74]. A correlated whole-genome phylogenetic distribution of different domains could also indicate that they interact with each other directly or at least share a functional role. Two new web resources - DIMA [75] and DOMINE [76] - integrate predicted and known domain interactions into comprehensive domain-interaction networks.

Blind evaluations

The well established Critical Assessment of Techniques for Protein Structure Prediction (CASP) project [1, 77] (see Box 1) has set up an additional function-prediction category. This is inherently different from the CASP structure-prediction categories, because at the end of the experiment the function of the target protein is likely to remain unknown. Nevertheless, the community concurred that the effort was justified because of its importance. The results of the first run were rather disappointing [78]: only a few groups participated in the challenge; 3D-structure predictions were rarely used for function prediction; and assessment procedure was too complex. Notably, however, the function predictions submitted by the different groups often agreed, and a 'consensus prediction' could be derived. A reassessment of the experiment after more experimental evidence had accumulated [79] revealed that a consensus prediction could reach as high as 80% accuracy, although the sample was too small to substantiate the significance of this finding.

CASP has fostered the development of many other experiments, such as AFP [80], BioCreative [81], GeneFun [82] and MouseFunc [83], that exploit a range of predictive methods to make functional annotations. Most computational methods participating in the AFP experiment to identify ligand-binding sites relied on the use of sequence and structural information from related proteins, and fell into the broad categories described above [80]. Only rarely did predictors attempt to identify ligand-binding sites de novo. In the second edition of the AFP experiment, in 2007, some novel ideas were explored, such as the potential contribution of protein disorder, and a systems-level analysis of pathways.

Regretfully, the lack of independent test sets for testing blind predictions prevents proper assessment of the strengths and caveats of individual methods. This general issue should concern the whole biological community. It will be difficult to improve function-prediction methods without reliable test sets and a good way to overcome this problem has yet to be found.

In summary, function prediction remains a challenge. Ab initio prediction (that is, not based on annotation transfer) usually provides very limited, if any, clues. Evolutionary relationships, though complicated by the ortholog/paralog dichotomy, are by far the strongest predictors and the next few years will see increasingly sophisticated methods for deciphering their functional meaning. Molecular biology is moving to a more holistic view of biological processes and this requires better integration of different types of data. Elucidating function needs the combination of information from genomes, sequences, transcription patterns and genetic variation, as well as the results of prediction algorithms. Community approaches will ultimately empower discovery-oriented biology and, in turn, improve its translation to medicine and the environment.