HOMCOS: an updated server to search and model complex 3D structures
- 1.3k Downloads
The HOMCOS server (http://homcos.pdbj.org) was updated for both searching and modeling the 3D complexes for all molecules in the PDB. As compared to the previous HOMCOS server, the current server targets all of the molecules in the PDB including proteins, nucleic acids, small compounds and metal ions. Their binding relationships are stored in the database. Five services are available for users. For the services “Modeling a Homo Protein Multimer” and “Modeling a Hetero Protein Multimer”, a user can input one or two proteins as the queries, while for the service “Protein-Compound Complex”, a user can input one chemical compound and one protein. The server searches similar molecules by BLAST and KCOMBU. Based on each similar complex found, a simple sequence-replaced model is quickly generated by replacing the residue names and numbers with those of the query protein. A target compound is flexibly superimposed onto the template compound using the program fkcombu. If monomeric 3D structures are input as the query, then template-based docking can be performed. For the service “Searching Contact Molecules for a Query Protein”, a user inputs one protein sequence as the query, and then the server searches for its homologous proteins in PDB and summarizes their contacting molecules as the predicted contacting molecules. The results are summarized in “Summary Bars” or “Site Table”display. The latter shows the results as a one-site-one-row table, which is useful for annotating the effects of mutations. The service “Searching Contact Molecules for a Query Compound” is also available.
KeywordsTemplate-based modeling Complex BLAST KCOMBU
Molecular interactions are the essence of molecular functions in all forms of life, and thus characterizing interacting molecular pairs and interaction sites is fundamental for molecular biology. Huge numbers of interacting protein–protein pairs, and compound-protein pairs have been analyzed and stored in databases [1, 2]. The 3D structures of molecular complexes reveal the atomic details of molecular interactions, and thus provide important information to understand the molecular mechanisms . The amount of 3D complex structural data in the PDB increased rapidly; however, it is still much less than that of reported interactions . To fill this gap, computer modeling approaches are frequently performed to extend the data of 3D complex structures. Both template-based modeling and de novo modeling are performed to model 3D complex structures. Usually, template-based modeling is performed first, because it has advantages in terms of accuracy and computation costs, if proper templates are available.
Various methods for template-based modeling have been proposed. They can be roughly classified into two categories: “complex threading” and “template-based docking”. Complex threading is modeling from two or more amino acid sequences, and these sequences are aligned (threaded) on the template structure. Szilazyi and Zhang further classified complex threading into two subcategories: “monomer threading and oligomer mapping” and “dimeric threading” . The “monomer threading and oligomer mapping” approach is a simple extension of the standard template-based modeling (homology modeling) of a monomeric structure. For each target input sequence, its template structures are searched by the sequence search methods [6, 7, 8, 9, 10, 11] or the monomeric threading methods [12, 13]. If a template structure found for one target sequence forms a complex with that for another target sequence in the biological units in the PDB, then they can be regarded as the template 3D structure of the target sequences. The fitness of the target sequences and the template structure is evaluated by the sequence similarities or the potential energies between protein chains [6, 7, 8, 9, 10]. In the “dimeric threading” approach, two target sequence are simultaneously aligned with two corresponding structures using the two-body protein–protein interfacial energies [13, 14, 15, 16].
The “template-based docking” method basically requires two or more monomeric 3D structures of the target proteins as inputs. For each target input structure, its similar structures are searched in the PDB. If a template structure found for one target structure forms a complex with that for another target structure in the biological units in PDB, then the target structures are superimposed on the corresponding template structure to generate a complex 3D model of the target structures. The reason why this method is called “template-based docking”, is that the standard de novo docking program also requires two monomeric protein 3D structures. The modeled 3D structure can be used instead of experimentally obtained 3D structures. To enhance the coverage of the prediction, most methods employ the monomer 3D modeling calculation as their first step [17, 18, 19, 20, 21]. The searches for the template structure are performed using various methods, including sequence similarity search, global 3D structure similarity search [17, 18, 19, 20, 21], and local 3D structure similarity search among interfaces [22, 23]. Since 3D structural searches, and especially local 3D structural searches can capture remote structural homologies and analogies, they can cover vast amounts of known interactions, although their accuracies are limited [24, 25].
Template-based modeling has also been applied to the modeling of small compound-protein complexes. For this approach, a target chemical compound is superimposed onto the template compound using the 3D structural alignment programs of chemical compounds . The standard chemical compound—protein docking calculations are often improved by using known compound-protein complexes as the templates [27, 28, 29, 30]. For most of these cases, the 3D structure of the target protein is assumed to be known, and only that of the target compound is predicted. Complex 3D models of both the compound and protein have also been built by template-based modeling approach [31, 32].
Nucleotide-protein 3D complex structures can been modeled using the template-based approach. To determine the target sites of DNA-binding proteins, the DNA sequences within protein-DNA complexes were replaced and the interaction energies were evaluated . Recently, template-based methods for modeling both DNA/RNA and proteins were proposed [34, 35, 36, 37].
Many WEB servers for modeling complex structures have been developed. Most of their targets are hetero protein dimers [7, 8, 10, 11, 12, 22, 23, 38], although a few servers are available for compound-protein and nucleotide-compound complexes. To annotate protein’s function completely, information about all types of protein complexes is required, because proteins often bind many types of molecules, such as other proteins, small compounds, metals, and nucleotides, to perform their functions.
Our server HOMCOS (HOmology Modeling of Complex Structure; http://homcos.pdbj.org) keeps the name of the server established in 2008 (http://strcomp.protein.osaka-u.ac.jp/homcos) . However, we have completely rebuilt the server from scratch, and the current HOMCOS server is very different and much more useful than the previous one. The previous HOMCOS server only handled protein–protein 3D complexes. In contrast, the current server contains all of the molecules in the PDB, including proteins, nucleic acids, small compounds and metal ions. The new server is able to perform both the “complex threading” and “template-based docking” approaches, since it accepts both amino acid sequence and monomeric 3D structure as queries. For compound-protein modeling, the 3D structure of a chemical compound can be used as the query. HOMCOS employs the standard program BLAST for finding protein templates , and the program KCOMBU for finding compound templates [30, 40, 41]. Another new and useful service is the ability to search for contact molecules for a query protein. For a user-input query sequence, HOMCOS searches its homologous proteins in the PDB, and summarizes their contacting molecules as the predicted contact molecules. The results can be summarized as a table for each site of the query protein. This is useful for annotating functional effects of nsSNP.
Materials and methods
Data architecture of the HOMCOS database system
The unitmol table describes the molecule in the asymmetric unit. Its primary key is a combination of pdb_id and asym_id. The asym_id is a new molecular identifier introduced in the mmCIF format, is often written as a capital letter (‘A’, ‘B’, ‘C’…). The classical PDB format also has a chain identifier, which is uniquely assigned to each protein and nucleotide polymer. In contrast, the new asym_id is assigned for all of the molecules, except waters, in the asymmetric unit (Fig. 2b). Therefore, all of the molecules in the PDB, not only proteins, but also nucleotides, small compounds and metals, have the corresponding unitmol data. The 3D coordinates of atoms in the unitmol are separately stored in the server, using the classical PDB format file. Their DSSP files are also stored, to retain their secondary structures and solvent accessibilities . The amino acid sequences of the unitmol molecules are stored in a FASTA format file for the BLAST search . Note that the residue numbers of the stored PDB files are mmCIF’s label_seq_id, not auth_seq_id; therefore, the residue numbers of the PDB files are consistent with the residue numbers of the FASTA file sequences in our server. For the unitmol molecules with a three-letter code (comp_id), their SDF files are stored in the server for the chemical structure search. It does not include any multi-residue molecules, such as peptides and oligosaccharides.
To summarize the predictions performed by the HOMCOS server, the redundancies of the sequences must be reduced. We perform an all-vs-all blastp comparison among all of the amino acid sequences of the unitmol molecules , and execute the single linkage clustering for them with sequence identity ≥95 % and coverage of aligned region ≥80 %. The label for the cluster is registered in the attribute cluster95 in the table for each protein unitmol molecule.
The asmblmol describes the 3D structure of the molecule in the biological unit. The mmCIF file contains information about the biological unit provided by both the authors and the software. Some of the biological unit are constructed by combining transformed unitmol molecules as shown in Fig. 2c. In the mmCIF file, the oper_expression identifier is assigned for the operation (rotation and translation) required to construct the biological units. Most of the oper_expression identifiers are numbers (‘1’, ‘2’, ‘3’,…), although some virus entries have special strings “PAU” and “XAU”, and some entries have a combination of two operations, such as “1-61” and “5-62” (PDB code:1m4x). The primary key of asmblmol is the combination of three ids, pdb_id, asym_id and oper_expression. The XYZ coordinates of the asmblmol molecule are also separately stored in a classical PDB format file. By definition, the amino acid sequence and the 2D chemical structure of each asmblmol molecule are the same as those of the corresponding unitmol molecule.
The table assembly summarizes the information about an assembly (biological unit), which is composed of several asmblmol molecules. Its primary key is a combination of pdb_id and assembly_id. A list of the asmblmol molecules of the assembly is stored in two array variables, asym_ids and oper_expressions. Note that many PDB entries have more than one candidate of their biological unit. For example, the PDB entry 3gyr has eight biological unit candidates; two homo hexamers (assembly_id = 1 and 2) defined by the author, and six homo dimers (assembly_id = 3, 4, …, 8) defined by the softwares PISA and PQS. Since there is no reliable standard for choosing one correct biological unit, the HOMCOS server contains all of the biological units described in the mmCIF file.
The table contact contains contacting residues between two asmblmol molecules belonging to the same biological unit (assembly data). If the distance between one of the heavy atom pairs of the two molecules is within 4 Å, then these two molecules are regarded as contacting molecule pairs and registered in the contact table. Its primary key is the combination of five keys: pdb_id, and one asmblmol key (asym_id, oper_expression) and another asmblmol key (asym_id_con, oper_expression_con). The array variable seq_id_con contains the residue numbers (seq_id) of the residues in the asmblmol molecule (asym_id, oper_expression) contacting with another asmblmol molecule (asym_id_con, oper_expression_con).
Outline of the HOMCOS server
Searching contact molecules for a query protein
Some protein families, such as globin and protein kinase, have hundreds of 3D structures in the PDB with various ligands and different mutations. Too many homologues yield too many bars on the page, which makes it very complicated for users. To solve this problem, the representative bars are shown in the default setting. For the monomer 3D bars, homologous monomeric 3D structures are listed in the order of increasing blastp E-value, only if the number of additionally aligned residues >10. For the 3D complex bars, a single linkage clustering is performed for all of the homologous dimers in the same interaction class. Only one representative dimer is shown for each cluster. The criteria of linking are as follows: (1) “type” of contact molecule is identical. (2) Tanimoto coefficient of binding sites is more than 0.2. The “type” of the contact molecule is set to cluster95 for proteins (see the previous section about data architecture), the three-letter code (comp_id) for compounds, metals, precipitants, and sequence for nucleotide, and the pdbx_description for others (see attribute names in Fig. 1). The default setting of the service is the summarized “Summary Bars” page. If a user clicks the “Full Bars” icon, all of the predictions are shown on the page.
This service assumes that proteins with similar sequences should have similar binding properties. In other words, a query protein may bind to contact molecules of similar proteins to the query. Fukuhara et al.  reported that sequence similarity is the most effective feature to predict interacting protein pairs; however, its accuracies are limited for remote homologues. Users have to be careful when interpreting our server’s predictions, especially with those based on weak similarities. On the top of the page in Figs. 5a and 6a, the links to change the threshold value of sequence identities (seq_id(%)) from 0 to 100 %. The default is 0 %, which means that only E-value <0.0001 is the criterion to choose the homologues. If the users would like to focus on more reliable predictions, we recommend an increase in the sequence identity.
Searching contact molecules for a query compound
Modeling complex 3D structure of a hetero protein multimer
3D model viewer
Modeling complex 3D structure of a protein-compound complex
Relationship to VaProS server
Some of the services of the HOMCOS server can be used through the VaProS server (http://p4d-infor.nig.ac.jp/vapros/). The VaProS is an integrated database system for structural life science, it includes not only the 3D structural database, but also the databases of sequences, molecular interactions, expressions, and diseases. It stands for “Variation effect on PROTein Structure and function”, because one of the important goals of the VaProS system is analyzing phenotypic effects of mutation using protein 3D structures. The page of the service “contact molecules for query protein” is shown in the VaProS page, with the name “3D Interaction”. For fast searching from the VaProS, the HOMCOS server stores pre-calculated BLAST results for all of the human sequences stored in UniProt. The VaProS server has the “Molecular Interaction” view, which shows molecular interactions as nodes and edges. At each edge for a protein–protein interaction, a link is placed to the service “modeling complex 3D structure of hetero protein multimer”.
Comparison with other servers
As we explained in the introduction, there are many servers for protein–protein interactions (PPIs) [7, 8, 10, 11, 12, 22, 23, 28]. The clear advantage of HOMCOS over these PPI servers is that it can deal with all of the molecules stored in the PDB. As shown in Figs. 5 and 6, the summaries considering all of the contacting molecules provide comprehensive insight about the target protein. The 3D structures of two proteins and a chemical compound often provide more functional information, as shown in Fig. 12. This extension is enabled largely by the mmCIF format files of the 3D structures. Especially, the identifier “asym_id” and the clear descriptions of the biological units are indispensable for our server, but are not included in the classical PDB format. Our program KCOMBU also contributes to the searching and superimposition of the chemical compound structures.
Another advantage of the HOMCOS server is its fast computation. Some WEB servers take more than half an hour for calculations [11, 23], or provide only pre-calculated 3D models [8, 38]. In contrast, the HOMCOS server only takes 1–3 min for searching and modeling for any user-input query molecules. This is simply because the HOMCOS server employs fast searching programs (blastp and dkcombu), and provides only sequence-replaced models as a default, and makes time-consuming atomic modeling optional. We think that our strategy provides a good balance between fast response and precise prediction for template-based modeling. We expect that users would always like to quickly know whether templates are available for their target molecules, because template-based modeling is useless when templates are not found. Detailed predictions will only be required when the user performs more detailed analyses, such as molecular simulations.
Further improvement of the server
The current HOMCOS server has many more functions than our previous server, established in 2008. However, the server still has room for improvement. First of all, we only employ the program blastp for protein sequence searches, but more sensitive profile-based programs or structure comparison programs are sometimes required to detect remote homologues. They can increase the coverage of the model at the expenses of the precision, and can provide longer and more accurate alignments with templates. The technical problem in implementing them in our server is their larger computation time. We first plan to include PSI-BLAST  in the near future, considering the balance between its computation cost and advantages.
Second, quality evaluations of the model should be introduced. Currently, the HOMCOS server only shows the sequence similarity and chemical similarity between the target query and template molecules. These similarities are reported to be good measures for the quality of models [30, 54], and detailed analyses for the evolutions of complex 3D structures will be reported elsewhere. Our analyses show that the sequence identity of the contact sites has a better correlation with the structural difference of the complex, especially for the complex with small chemical compounds. The HOMCOS server shows the new measure, the sequence identity of the contact sites, in the “Summary Bar” view (Fig. 5) and the 3D model view (Fig. 11). Other measures, such as interfacial contact potential energies may need to be introduced.
Third, some molecules such as nucleotides and polysaccharides cannot be searched and modeled by the current HOMCOS server, although our database contains all of the molecules in the PDB. As shown in Fig. 5, the service “searching contact molecules for query protein” can find the complex structures with nucleotides and polysaccharides. If their sequences and chemical structures do not have to be changed, they can be used as the rigid body template molecule with the homologous template protein. We hope this function is quite useful, but may not be sufficient. There are no standard tools for searching polysaccharides, and for modifying chemical structures of nucleotides and polysaccharides, although a nucleotide sequence can be searched by the blastn program . We may have to develop these tools, if many users request the modeling of these molecules.
Fourth, we maximally accept two query protein sequences and one chemical query compound; however, more query molecules may have to be accepted. In principle, the modeling of multiple query sequences is not impossible. The problems are rather technical: its high computation cost and the need for a user-interface to input many query molecules.
Finally, our predictions of binding molecules have limited accuracies. They are performed by our two services, “searching contacting molecules for a query proteins” and “searching contacting proteins for a query compound”. These predictions are based on the similar property principle, which states that “similar molecules have similar common binding properties”. However, this rule is simply empirical, and exceptional cases cannot be ignored. Fukuhara et al.  reported that sequence similarity is essential to predict interactions, however, its accuracy is limited. Users have to be careful when using these services, especially when similarities are weak. More studies must be performed for more precise prediction of molecular interactions.
Our new server has been greatly improved in comparison to its previous version, established in 2008. It contains all of the different types of molecules in the PDB, and provide a useful summary of the contact molecules for a query protein. We hope our server will be useful for a wide range of life-science researchers, and will contribute toward making the full use of the 3D structural data stored in the PDB.
We thank Prof. Akira Kinjo and Kei Yura for supervising the project. We also thank Prof. Haruki Nakamura for critical reading of the manuscript and allowing us to use the address “.pdb.org” for our server. We are grateful to the participants of the HOMCOS and VaProS workshops for their helpful comments and advices. This work is supported by the Platform Project for Supporting in Drug Discovery and Life Science Research (Platform for Drug Discovery, Informatics, and Structural Life Science) from Japan Agency for Medical Research and Development (AMED).
- 1.Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U, Jandrasits C, Jimenez RC, Khadake J, Mahadevan U, Masson P, Pedruzzi I, Pfeiffenberger E, Porras P, Raghunath A, Roechert B, Orchard S, Hermjakob H (2012) The IntAct molecular interaction database in 2012. Nucleic Acids Res 40:D841–D846CrossRefPubMedGoogle Scholar
- 42.Westbrook JD, Fitzgerald PMD (2003) The PDB format, mmCIF, and other data formats. In: Bourne PE, Weissig H (eds) Structural bioinformatics. Wiley, HobokenGoogle Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.