The Protein Model Portal
- 1.8k Downloads
Structural Genomics has been successful in determining the structures of many unique proteins in a high throughput manner. Still, the number of known protein sequences is much larger than the number of experimentally solved protein structures. Homology (or comparative) modeling methods make use of experimental protein structures to build models for evolutionary related proteins. Thereby, experimental structure determination efforts and homology modeling complement each other in the exploration of the protein structure space. One of the challenges in using model information effectively has been to access all models available for a specific protein in heterogeneous formats at different sites using various incompatible accession code systems. Often, structure models for hundreds of proteins can be derived from a given experimentally determined structure, using a variety of established methods. This has been done by all of the PSI centers, and by various independent modeling groups. The goal of the Protein Model Portal (PMP) is to provide a single portal which gives access to the various models that can be leveraged from PSI targets and other experimental protein structures. A single interface allows all existing pre-computed models across these various sites to be queried simultaneously, and provides links to interactive services for template selection, target-template alignment, model building, and quality assessment. The current release of the portal consists of 7.6 million model structures provided by different partner resources (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM, ModBase, SWISS-MODEL Repository). The PMP is available at http://www.proteinmodelportal.org and from the PSI Structural Genomics Knowledgebase.
KeywordsProtein model portal PSI structural genomics knowledgebase Comparative protein structure modeling Homology modeling Model database
Center for structures of membrane proteins
Joint center for molecular modeling
Joint center for structural genomics
Midwest center for structural genomics
Northeast structural genomics consortium
New York SGX research center for structural genomics
Protein Data Bank
Protein Model Portal
- PSI SGKB
PSI structural genomics knowledgebase
Protein structure initiative
Swiss Institute of Bioinformatics
Structural genomics has been successful in determining the structures of many unique proteins in a high throughput manner. Since 2001, these efforts have resulted in more than 6,772 structure depositions to the PDB , 3,251 of which are from the NIH-sponsored Protein Structure Initiative (PSI) Centers. Nevertheless, the number of known protein sequences is much larger than the number of experimentally solved protein structures, and this number is growing at an unprecedented rate due to large-scale genome sequencing and meta-genomics projects . Homology (or comparative) modeling methods make use of experimental protein structures to build models for evolutionarily related proteins. This technique predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). For every experimentally determined structure, often models for hundreds of proteins can be derived using a variety of established methods for comparative protein structure modeling methods, which dramatically increases the structural coverage of protein sequences. Structural genomics and homology modeling thereby complement each other in the exploration of protein structure space .
The quality of a comparative protein structure model depends on the evolutionary distance between the sequence of the target protein to be modeled and the template structure. In general comparative models sharing more than 40% of sequence identity with the template are considered high-quality models. Comparative protein structure models are routinely used in a widespread range of biomedical applications such as the rational design of mutagenesis experiments, the interpretation of disease-related mutations, or structure-based virtual screening studies [4, 5, 6]. However, despite this tremendous growth in protein structure data in recent years, structural information is often not used to its full extent in biomedical research projects. One reason is that experimental protein structures are only available for a small subset of all known protein sequences. A significant impediment in using 3D-models effectively is that model information on a specific protein is distributed over different web sites in heterogeneous formats, using incompatible accession code systems. We have developed the Protein Model Portal (PMP, http://www.proteinmodelportal.org/) as part of the PSI Structural Genomics Knowledgebase (http://kb.psi-structuralgenomics.org/KB/) in order to provide a single portal which gives access to the various models that can be leveraged from PSI targets and other experimental protein structures [7, 8]. The Portal has been presented at the NIGMS PSI Bottlenecks Workshop. Here, we describe the challenges that exist for building such a model portal, present the technical implementation, show specific examples for accessing the portal, briefly report on the community workshop held on “Applications of Protein Models in Biomedical Research”, and discuss future developments of the project.
Various algorithmic approaches [10, 11, 12, 13, 14, 15, 16] with different strengths and weaknesses have been developed for building 3-dimensional models of proteins. Also the quality of individual models highly depends on the evolutionary proximity to the selected structural templates. Finally the update frequency of different services varies. For these reasons, a consensus view of the results obtained from different modeling resources is very often helpful in identifying the most reliable solution.
The content of the PDB database is organized around experimental information (structure centric), whereas model information is based on sequence databases (sequence centric); difficulties arise due to the fact that sequence database content and accession codes are often transient and frequently incompatible between different databases.
Models often cover only fragments of a protein sequence. The different segments modeled for a given target protein are usually based on various alternative alignments, alternative templates, or result from diverse modeling procedures.
Models are not stable information by itself, but reflect the current status of sequence and structure databases at time of modeling. Models have to be revised, as new, more suitable template structures become available. Also, changes in the primary sequence of target proteins in the sequence databases necessitate remodeling.
Models have typically few intrinsic annotations (in comparison to UniProt  or PDB databases) which go beyond the alignment to the template structure. Changes in the functional annotation of a protein database entry (e.g., UniProt; InterPro ) occur independent from and more frequently than changes of the primary amino acid sequence, which would require rebuilding of the model.
Models are not experimental observations, but the results of theoretical predictions. While standards for describing the reliability and limitations of the most commonly used experimental structure determination techniques have been established, the spectrum of reliability and applicability of current modeling methods is broad and the level of uncertainty is significantly higher [19, 20]. Therefore, detailed information about each individual model is crucial for assessing its expected accuracy, and thereby determining its scope of applicability .
The number of currently available models is orders of magnitudes larger than that of experimental structures (ca. 7 million vs. 50,000) and, therefore, poses technical challenges in efficient data handling.
As model information is complex—one needs to estimate the expected accuracy and reliability of a specific model or for parts thereof; users of models are often unsure to which extent a model can be used for a given application.
To summarize, model information is heterogeneous, highly context dependent, dynamic, and consists of large volumes of data. Consequently, a community workshop held at Rutgers on archiving structural models of biological macromolecules has recommended no longer accepting models as part of the PDB archive, and to establish a portal for protein model information . Following this recommendation, and taking into account the challenges mentioned before, we have developed the PMP as a web portal which federates resources from model providers, experimental protein structures (PDB), and functional annotation databases.
The aim of the PMP is to foster the effective usage of molecular model information in biomedical research by providing unified access independent of individual sequence nomenclature and accession code system and by supporting the development of data standards to facilitate exchange of information and algorithms. Furthermore, PMP aims to provide a forum for discussions between developers of modeling methods and applied biomedical researchers on best-practices, including methods for quality assessment, guidelines for the publication of theoretical models, and educational resources on usage of models for different biological applications.
Architecture of the protein model portal
The PMP is queried from the PSI structural genomics knowledgebase using web services based on SOAP (Simple Object Access Protocol) and REST (Representational State Transfer). Users can also access the Portal directly using the web-based graphical user interface implemented with PHP at http://www.proteinmodelportal.org/. Functional annotation for individual target proteins is retrieved in real-time from the respective annotation providers using web services: information about individual domains is retrieved from the InterPro database using DAS , whereas sequence annotations from the UniProt database are retrieved using REST from the UniProt server (http://www.uniprot.org). The three-dimensional coordinates of the individual models are stored at the different model providers and retrieved by the portal in real-time when required for the visualization of the model overview page. Preview images of the structural model, generated using Molscript , Render, and Raster3D , provide the first quick preview of the protein model. The alignment between the target sequence and the template is inferred dynamically on-the-fly by structural superposition of the final model to the template structure using MAMMOTH . The consequent use of portal technologies allows federating a set of heterogeneous resources into a single portal while at the same time ensuring consistency of the exchanged data.
Web access to PMP
Model portal content
A ‘‘Workshop on Applications of Protein Models in Biomedical Research’’ was held in San Francisco in July 2008, where protein structure modelers explored how models are used in biomedical research, and which requirements and challenges exist for the different applications. The workshop program involved a first day of 16 presentations on topics that ranged from the coverage of protein sequence-structure space to the uses of modeling in medicine. On the second day, open questions were addressed by four independent discussion groups. The participants discussed the state-of-the-art in applying molecular modeling to biomedical problems, requirements and challenges for various applications, as well as ways to strengthen the collaboration between the modeling and experimental communities. The PMP has been recognized as an important means for increasing the impact of molecular modeling on biology and medicine, and specific recommendations for the further development of the PMP were made. The outcome of the workshop will be made available for the community in the form of a white paper.
Conclusion and outlook
We have established a single portal that allows users to simultaneously and transparently perform searches against all model information made available by the participating sites, using various protein identifiers or amino acid sequence as the query input. This is achieved by federating the distributed resources via web services. The next step in the development of the PMP will include a common interface for submitting interactive modeling requests to different automated modeling services, tools for comparing and assessing the quality of protein structural models, and the possibility to map a wide variety of functional annotations to the protein models. The consequent usage of Web services will not only allow to coordinate information provided by other services within the portal, but also to complement sequence-based resources such as genome browsers, with model-based information. The development of widely accepted standards for data exchange and of tools for quality assessment of models would be one of the challenges in the future, which are expected to be addressed during a second PMP workshop in 2009.
We are grateful to Rainer Pöhlmann ([BC]2 Basel Computational Biology Center & Biozentrum University of Basel) for professional systems support, Eric Jain for the swift implementation of md5 based REST queries on the UniProt server and Michael Künzli for the computation of the statistics of the PMP models. We would like to thank all teams at the different partner sites for the great collaboration on the PSI SGKB Protein Model Portal integration, especially Wendy Tao (RCSB-PDB), Andrej Sali and Ursula Pieper (UCSF/NYSGXRC), Christine Orengo and David Lee (MCSG and UCL), Adam Godzik (JCMM), and Diana Murray (NESG). We are indebted to Roland Dunbrack Jr. (FCCC/NMHRCM) for invaluable advice on the development of PMP. The PSI SGKB Protein Model Portal was supported by the National Institutes of Health NIH as a sub-grant with Fox Chase Cancer Center grant 3 P20 GM076222-02S1, and as a sub-grant with Rutgers University, under Prime Agreement Award Number: 3U54GM074958-04S2.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- 6.Tramontano A (2008) The biological applications of protein models. In: Schwede T, Peitsch MC (eds) Computational structural biology. World Scientific Publishing, SingaporeGoogle Scholar
- 8.Berman HM et al (2008) The protein structure initiative structural genomics knowledgebase. Nucleic Acids Res. Database Issue (in press)Google Scholar
- 14.Schwede T et al (2008) Protein structure modeling. In: Schwede T, Peitsch MC (eds) Computational structural biology. World Scientific Publishing, SingaporeGoogle Scholar
- 18.Mulder NJ, Apweiler R (2008) The InterPro database and tools for protein domain analysis. Curr Protoc Bioinformatics. Chapter 2:Unit 2.7Google Scholar
- 32.Pieper U et al (2008) MODBASE, a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. doi: 10.1093/nar/gkn791
- 33.Kiefer F et al (2008) The SWISS-MODEL Repository and associated resources. Nucleic Acids Res. doi: 10.1093/nar/gkn750