Structural genomics is the largest contributor of novel structural leverage
- First Online:
- Cite this article as:
- Nair, R., Liu, J., Soong, T. et al. J Struct Funct Genomics (2009) 10: 181. doi:10.1007/s10969-008-9055-6
- 347 Views
The Protein Structural Initiative (PSI) at the US National Institutes of Health (NIH) is funding four large-scale centers for structural genomics (SG). These centers systematically target many large families without structural coverage, as well as very large families with inadequate structural coverage. Here, we report a few simple metrics that demonstrate how successfully these efforts optimize structural coverage: while the PSI-2 (2005-now) contributed more than 8% of all structures deposited into the PDB, it contributed over 20% of all novel structures (i.e. structures for protein sequences with no structural representative in the PDB on the date of deposition). The structural coverage of the protein universe represented by today’s UniProt (v12.8) has increased linearly from 1992 to 2008; structural genomics has contributed significantly to the maintenance of this growth rate. Success in increasing novel leverage (defined in Liu et al. in Nat Biotechnol 25:849–851, 2007) has resulted from systematic targeting of large families. PSI’s per structure contribution to novel leverage was over 4-fold higher than that for non-PSI structural biology efforts during the past 8 years. If the success of the PSI continues, it may just take another ~15 years to cover most sequences in the current UniProt database.
KeywordsProtein structure determinationStructural genomicsEvolutionProtein universe
- 3D structure
Here used exclusively to refer to the three-dimensional coordinates of each atom in the native conformation of a protein
Joint Center for Structural Genomics
Midwest Center for Structural Genomics
Northeast Structure Genomics Consortium
Protein Data Bank of experimentally determined 3D structures of proteins
Protein structure initiative at the NIH-NIGMS
New York Structural GenomiX Research Consortium
Unification of SWISS-PROT, TrEMBL and PIR protein sequence database
Systematic targeting of the largest families without structural coverage
The US contribution to Structural Genomics (SG), the Protein Structure Initiative (PSI), is funded by the National Institutes of Health-National Institute of General Medical Sciences (NIH-NIGMS). The second 5-year phase of the initiative, PSI-2, began in 2005. Four large-scale Structural Genomics Centers were created for high-throughput production of protein structures (JCSG, MCSG, NESG, NYSGXRC), as well as six Specialized Research Centers both charged with continuing to develop technologies needed for large-scale protein structure determination . The four large-scale production centers are currently poised to generate over 3,000 entirely new experimental 3D structures of proteins for the biomedical research community in addition to the over 1,300 structures that originated from the pilot phase. At the end of the first 3 of those 5 years, the four centers had already deposited almost 2,000 new 3D structures (data from TargetDB, ).
Through the development and advancements of biochemical, robotic, NMR, crystallographic and computational techniques, SG centers are decreasing the cost and time required to determine a protein structure in order to advance the structural coverage of sequence space and biomedical research. The development and advancement of high-throughput protein production and protein structure determination pipelines are critical to the eventual characterization of protein structure space, expanding our understanding of molecular evolution, and to address biomedical problems such as drug discovery.
The challenges from these objectives for computational biology are mainly twofold: (1) identify targets for which each experimental structure will have a high leverage for modeling and (2) focus on those targets that will likely yield structures using current HTP methods [14, 21, 23, 37].
Metrics of success
Several metrics of success have been developed to monitor the evolution of structural genomics during PSI [8, 18, 22]. These include (i) total numbers of PDB depositions, (ii) numbers of distinct sequences (<98% pairwise sequence identity) for which an experimental structure is determined, (iii) numbers of ‘novel structures’, defined as a structure for a protein having <30% sequence identity with any protein structure already in the PDB, (iv) first 3D structure from a particular domain family; (v) first 3D structure from a particular functional class of proteins, (vi) protein structures which provide a novel testable hypothesis about function, and other metrics. In the following paragraphs we outline some of these metrics relating to the value of experimental 3D structures to provide useful structural information about homologous protein sequences.
Modeling leverage of experimental structures
Homologous proteins from different organisms, defined as those that have evolved from a common recent ancestral protein, usually share similar 3D structures [10, 28, 31, 35]. Therefore, the PSI does not aim at producing structures of every protein from every organism. Instead, the PSI aims to identify structural domains in proteins, systematically organize these protein domains into sequence-structure families, and determine the 3D structure of one or a few representatives from many of these families. The ultimate goal is to attain structural coverage for every major protein domain family found in nature.
Almost 50,000 experimentally determined 3D structures have been deposited into the PDB . However, this accounts for less than 1% of the ~6 million protein sequences deposited into UniProt . As genome sequencing technologies advance, sequence data is being generated at an ever increasing pace, not only for complete genomes of organisms but even for entire ecologies of hundreds or thousands of microorganisms (META genomics) [12, 36, 39]. Accordingly, the rate of discovery of new protein sequences will continue to increase much faster that the rate of protein structure determination.
The fact that homologous protein domains have similar structures enables the application of homology, or comparative, modeling methods [17, 32]. Comparative modeling leverages in the information provided by each experimental structure many fold. For example, it has been proposed that experimental determination of 3D structures for one representative of the largest 1,000–2,000 protein domain families, would be sufficient to allow modeling, at some approximate level, of more than half of all the residues in all of UniProt [21, 38].
The “modeling leverage” of a particular 3D structure (modeling template) depends on several factors, including (i) the sequence similarity between the template with known experimental structure and target proteins of unknown structure, (ii) the method of comparative modeling, and (iii) the criteria by which a model is judged to be “useful”. The third factor (what is good enough?) can be especially difficult to ascertain, and rather inaccurate models (e.g. just the overall fold) are sufficient for some important applications of models, while other applications may require very high accuracy models. Benchmark studies suggest that sequence similarity of >40% over >50 residues generally provide models with heavy atom root-mean-square deviations of <2.5 Å from the true experimental structure [6, 11, 16, 24–27]. However, templates that are less sequence similar to the target structure may provide even higher accuracy models, and models generated for more sequence similar templates may result in less accurate models. Leverage also must be defined with respect to what portion of the target protein can be modeled from the experimental template, leading to metrics for full protein models, protein domain models, or residue models per experimental template. Modeling leverage also needs to be defined with respect to a particular sequence database; e.g. with respect to a particular version of UniProt.
The concept of modeling leverage is intimately associated with the concept of structural coverage; i.e. the number or percentage of a particular set of protein sequences, domains, or residues, which can be modeled from a particular set of experimental protein templates. Structural coverage of the protein universe (i.e. a particular version of UniProt), of an entire proteome of an organism (e.g. the human proteome), of an ecology of organisms (e.g. all human gut microorganisms); or of a system of co-functioning proteins (e.g. proteins associated with a particular biological process), are all key metrics in measuring the success of SG that depend on the definition of modeling leverage.
Novel modeling leverage and novel coverage
Related to the concept of modeling leverage is the concept of novel modeling leverage , operationally defined as the number of proteins/domains/residues that could not be modeled (based on the above specific definition of leverage) as of the date the subject experimental structure was deposited into the public PDB . The novel leverage provided by a set of experimental 3D structures across a particular set of protein sequences defines the novel coverage provided by these structures. This concept of leveraging experimental structures, and particularly novel leverage, has been fundamental to the process of target selection by large-scale centers during PSI-2. In particular, the large-scale centers systematically target the largest protein domain families for which we currently have little or no structural coverage.
The need for a standard convention
The modeling leverage value of a particular experimental structure, or the coverage of a set of sequences by a set of structures, depend upon the details of thresholds defined for sequence similarity that can be expected to provide a “useful” model, as outlined above. There are also certain technical issues which may or may not be accounted for in any method of assessing novel leverage. Examples of such issues, not used in the current work include: (i) while a sequence may be modeled from a structure already in the PDB on the date of deposition of subject structure, the subject structure may allow more accurate modeling of this sequence, and (ii) one may or may not discount the novel modeling leverage of a particular structure by the modeling leverage of experimentally-determined structures subsequently deposited in the PDB. It is simply not possible to define universal thresholds or criteria of model accuracy that are appropriate for the full range of applications for which models are used. Thus, the novel leverage reported for the same data by different groups may vary widely. Here, we adopt as a convention the definitions and thresholds proposed by Liu et al.  for assessing modeling leverage, novel modeling leverage, and the corresponding metrics of novel coverage. This is a convenient measure of “modelability” that is easily reproducible with relatively modest computing resources (the analysis presented here consumed less than 2 CPU-years).
All data about the status of structural genomics targets were taken from TargetDB . Leverage, novel leverage, and the corresponding metrics for coverage were determined by the method of Liu et al. . The basic concept is the following. We begin with a fixed version of UniProt, in this case release 12.8 from Feb 2008; containing 5,678,599 protein sequences with 1,851,231,082 residues. For this version we compile the number of proteins and residues that align (PSI-BLAST E-value 10−10, 3 iterations on UniProt, one on PDB with background estimates based on UniProt size; for more details see Liu et al. ) to any protein of experimentally determined 3D structure deposited into the PDB at a given time point T = T0. Novel leverage is then everything that is not covered by this simple alignment protocol and has arisen from structures added to the PDB at T1 > T0; total leverage is computed as all structures in the PDB covered by this criteria.
We loosely referred to an experimental structure (more precisely the structure specified by a particular PDB identifier) as a novel structure if at least 50 residues of this structure could be used to create novel leverage. This implies in particular, that novelty was not at all constrained by any particular definition in terms of the similarity of this new coordinate set in terms of structure to any other structure already in the PDB. When compiling per-residue estimates for novel leverage, we did not apply any such threshold, instead, any single residue that could not have been modeled before counted.
Novel leverage versus novel coverage
In the context of this work, we used the DB = UniProt 12.8 (Feb. 2008), E0 = PSI-BLAST E-value < 10−10. Coverage often is compiled with respect to the same database as leverage, i.e. DS = DB. In fact, this is the metric that we compiled for this work. However, we have also compiled coverage values for the set of proteins in particular organisms, e.g. focusing on the structural coverage for the human proteome . In principle, leverage and coverage are symmetric: both can be compiled on the same data set, and the only essential difference is that one counts numbers, the other percentages.
Both leverage and coverage can be computed on a per-structure, on a per-residue or on per-annum base. Frequently, we also compiled those numbers as sums over all PSI structures in light of the sum over all PDB structures and/or over all PDB structures without those PSI-structures.
With the same choices as above: DB = UniProt 12.8, and E0 = PSI-BLAST E-value < 10−10. The deposition date in the PDB entry decides whether or not a structure is novel. One important and desired consequence of this definition is the following. Assume you solved a structure that has high impact in the sense that many groups use it as a basis for molecular replacement to do more accurate structures of the same or of a similar protein sequence. Then the first structure in this family of structures is recognized for the novel information it provided on the date it was deposited in the PDB. The problem that remains and that we have not addressed convincingly, yet, is how to measure the benefit of a structure that allows to build better models for proteins for which we can already build models. As indicated by Eq. 2A, only sequences that match to the sequence of the template with the minimal threshold (E-value < 10−10) count.
Results and discussion
Every other novel structure from the USA now from the PSI
Structural coverage of UniProt
Number of proteins (in Kilo)
Number of residues (in Mega)
PSI novel leverage proteins (Kilo)
SG novel leverage proteins (Kilo)
PSI novel leverage residues (Mega)
SG novel leverage residues (Mega)
3D coverage 2000
New 3D 2000–2008
New 3D 2005–2008
Structural coverage of sequence space continues to increase
The PSI contribution to the coverage added by US structures is now exceeding the 50% mark, i.e. PSI-2 contributes more novel leverage, and hence more coverage than all other US efforts (Fig. 1b). With this increase, the US contribution to the novel leverage worldwide continues to increase (Fig. 1b). Interestingly, the contribution of non-PSI SG, which peaked in 2004–2007, has contributed relatively little to the worldwide annual novel leverage, while novel leverage contributions from non-US, non-SG groups has been relatively constant at ~40% annually.
Overall, the structural coverage of UniProt 12.8 increased slowly, up until about 1992 (Fig. 2a). After that, structural coverage increased at roughly a constant annual rate. The growth slowed down slightly toward the onset of structural genomics, because, despite the continued annual increase in the number of structures determined, it is getting increasingly difficult to succeed for proteins that have so far eluded structure determination. Novel leverage becomes an increasingly evasive objective. The advent of structural genomics countered this development and returned the growth in structural coverage to almost constant annual rates. During the course of PSI, the overall structural coverage for UniProt 12.8 has approximately doubled (Table 1) from ~22 to ~45% per-protein coverage (Fig. 2a).
If we reset the coverage clock to zero at the beginning of PSI, and compute the gain over the structural coverage in a given year (Fig. 2b), we note that between 2000 and 2008 the per-protein structural coverage of UniProt 12.8 increased by about 26 percentage points (Fig. 2b: sum over all contributions; Table 1: 1,485/5,679 K) corresponding, by 2008, to an overall per-protein coverage around 45%. Some 22% of the increase in per-residue and per-protein structural coverage provided by all structures deposited worldwide came essentially from four PSI large-scale centers, compared with ~34 and ~40% increase in the structural coverage of UniProt 12.8 by all non-PSI US and all non-SG, non-US groups, respectively, in the same time frame (Fig. 2b). Note that the precise values here depend crucially on the parameters chosen. Our restriction to E-values ≤ 10−10 implies that the inferred structural models are of relatively high reliability and cover most of the aligned regions [16, 25]; higher leverage and coverage can be achieved at the expense of accuracy [3, 17, 27, 33].
PSI per-protein gain in novel leverage is 3–4 fold higher than PDB without PSI
PSI has by now targeted and worked on most of the largest 16,787 sequence-structure families with prokaryotic representatives. PSI-2 continues to pick the largest remaining families, however, those become smaller. The novel leverage of all non-PSI structures in the PDB is also decreasing. This is partly due to the same reason: the largest families are either structurally covered or continue to evade structure determination. Furthermore, as already discussed, the generation of novel leverage becomes increasingly challenging.
Does this imply that attempts at experimentally determining structures for new sequence-structure families will be doomed? Despite efforts in optimizing novel leverage and providing structures for as yet uncharacterized domain sequence families, structural genomics has not discovered many truly novel structures [1, 5, 7, 8, 13, 15, 18]. Indeed, the discovery of previously unobserved protein structure space (new geometries and principles not seen before) is becoming increasingly difficult . This implies that (i) we now know most protein structure geometries or folds and (ii) on average, staying within the vicinity of known structures is more likely to result in a successful structure determination. By design, PSI-2 has been attempting and succeeding in targeting proteins which are not similar to proteins with known structures, i.e. to increase the odds of discovering new territories through their development of high-through pipelines and technologies. To rephrase this in a common analogy: by focusing on protein domain families with no structural representatives PSI-2 has systematically targeted and succeeded in reaching “higher-hanging fruits”.
Many other criteria for success
Structural genomics, by design, is a hypothesis-generating instead of a hypothesis-driven endeavor. It shares this aspect with many new high-throughput genomics projects in the evolving molecular biology discipline although—unlike other genomics projects—structural genomics continues to generate very high-resolution, detailed molecular data. The success of the PSI is reflected by many aspects which range from increasing the speed of structure determination and deposition (both dramatically increased during this decade), through high literature impact and extreme reduction in the number of papers per structure to the push of automation and robotics which increases the diverse biophysical measurements readily available to researchers in related fields with different expertise.
Objective criteria that allow the monitoring of the degree to which scientific endeavors deliver what they promised are naturally becoming integral parts of a landscape in which the funding for science shrinks, while the challenges for the scientist arguably increase, and in which an increasing fraction of all science is funded by temporal grants. Here, we have demonstrated that PSI-2 has been extremely successful by the aims it posed at the start: it contributed substantially toward the increase of novel leverage to the extent that a future without PSI will clearly imply a considerable lengthening of the time needed to cover today’s protein universe.
Given that the PSI was successful in meeting the milestones that the PSI commission posed, the aim now is to finish with a wider perspective that considers the optimization of structural coverage as a means and not as an end. One aspect of structural genomics is the adventure of mapping unknown spaces. We seek connections to create maps. These objectives require the coincidence of a wealth of sequences and structures in spaces that have hardly been experimentally covered (i.e. families of unknown function) but appear to be extremely important, as demonstrated by the annotations for the universal family of EVE/PUA/PUA-like proteins enriched by structural genomics . All these connections contribute to the understanding of protein evolution. The PSI has covered an immense fraction of the prokaryotic sequence-space in terms of generating protocols, reagents, and experimental data. This wealth is available today through the PSI Materials Repository (http://www.hip.harvard.edu/gateway/) and through the PSI Knowledge Base (http://kb.psi-structuralgenomics.org/). A relatively small fraction of the target families have so far yielded experimental structures, but this “small” fraction now contributes over one-third of the novel leverage worldwide, providing structural templates for over 300,000 new reliable protein structure models.
Another long-term impact is the contribution toward making structure become an integral part of molecular biology and toward converting structure determination from an amazing art mastered by few into a pipeline accessible to many. Clearly the cost reduction, the development of sophisticated semi-automated high throughput pipelines contributed immensely to making this happen. Without structural genomics, today’s level of automation would not have been reached at all. The development of cheaper sequencing techniques was certainly no goal of the human sequencing project. But those techniques have been changing biology immensely over the last decade.
16–20 years to go to complete coverage of sequence universe?
How much more is left to do? The following rationale provides an over simplified answer. Firstly, we have estimated that at least 20% of all residues in proteomes are not viable targets for structural genomics because they encode complex integral membrane proteins, long continuous coiled-coils regions, long regions that are natively unstructured, and leftovers from partial models (e.g. model A covers domain D1 from residues 6–55, model B covers domain D2 from residues 61–100 in a protein of 100 residues; this leaves 10 residues 1–5 and 56–60 as non-viable targets) . Most of these 20% of the residues are in short regions not assigned to a particular domain and are probably some sort of domain linkers and embellishments. Put differently, 80% per-residue coverage implies “completion”. Secondly, today’s coverage is about 40%, i.e. 40% (80–40) remains to be done. Thirdly, extrapolating from Fig. 2a, we might estimate the average annual per-residue growth in coverage of UniProt 12.8 to be about 2.5%. Assuming this rate to hold for the future, we would estimate 16 years (40/2.5 = 16) to structurally cover whatever remains of the UniProt 12.8 sequence database. While sequence space continues to grow, much of this new growth maps to domain families covered by this 80% of current proteins sequence universe.
Clearly, the assumption of identical growth is overly optimistic: the rate has been kept at a linear growth only due to the focused effort of structural genomics. Given that PSI-2 has already cloned almost all the largest viable families, it is clear that the future leverage will be lower. Moreover, as new genomes are sequenced, only a fraction of these sequences map to known protein domain families, and the uncovered protein universe continues to grow.
Furthermore, it might be argued that the 40% of the residues that remain to be structurally explored will constitute proteins that are much more challenging for structure determination than those in the 40% of the residues that are covered today. If so, structural genomics methods might fail to capture those residues in these much more challenging classes of proteins, and our assumption of a constant growth rate might be inappropriate. True, this might be so, and we have no scientific argument to dispel this concern. However, we can move back into the past and pretend to estimate for what was then the future: e.g. if we had taken the growth rate from 1994 to 2000 to estimate the coverage of 2008, we would have been completely right (Fig. 2a).
Where from here?
We have established structural genomics as an extremely efficient way to discover new areas in the protein universe that will undoubtedly continue to invoke testable hypothesis for years to come. Will the trend continue? Can we extrapolate from today’s data, or will we need something completely different to efficiently cover what remains? Clearly, we have to improve structural determination for sequence-structure families from eukaryotes. Today, it requires some 5–10-fold more resources to determine the structure for an average eukaryotic protein than for an average prokaryotic protein. A considerable fraction of the untouched sequence space falls into sequence-structure families that exclusively represent eukaryotes. Clearly, targeting this important domain becomes an important objective. Another fact of PSI-2 was that structure determination has so far succeeded for less than 30% of all families targeted. Developing techniques that allow a substantial increase in this yield appears to be another important goal.
The final question seems to be hovering around the issue of how much will the part of the universe without structural coverage differ from the part we cover today? Clearly, we need to find ways to make structural genomics work for types of proteins for which it has so far had only limited success, including membrane proteins, eukaryotic proteins, and secreted proteins. Are there any new structural principles out there that remain to be discovered and that totally elude today’s techniques for structure determination? Biology is so full of innovation and surprise that the answer will clearly be in the affirmative. To which extent this will be the case remains utter speculation. However, we have strong evidence that a considerable part of what is left falls into the category of proteins that are unusually flexible, or intrinsically unstructured and that possibly do not adopt regular structures without a binding partner. Do we therefore have to step up in terms of complexity and attack the problem of a structural genomics for complexes? Clearly, this will be one of the important challenges for both the short-term and long-term of structural genomics.
Thanks to Marco Punta (Columbia) for critical comments on the manuscript and work; to Paul Glick and Guy Yachdav (Columbia) for computer assistance, and to all those who contribute to making structural genomics become an exciting scientific breakthrough. This work was supported primarily by the grant U54-GM074958-01 to the Northeast Structural Genomics consortium (NESG). PSI would not be possible without the dedication and continued support of grant agencies and in particular of those administrating grants. We specially like to acknowledge many at NIGMS, in particular, of Jeremy Berg, John Norvell, Charles Edmonton, and Jerry Li. Last, not least, thanks to Amos Bairoch (SIB, Geneva), Rolf Apweiler (EBI, Hinxton), Phil Bourne (San Diego Univ.), John Westbrook (Rutgers), Helen Berman (Rutgers), and their crews for maintaining excellent databases and to all experimentalists who enabled this analysis by making their data publicly available.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.