Coverage of whole proteome by structural genomics observed through protein homology modeling database

Yura, Kei; Yamaguchi, Akihiro; Go, Mitiko

doi:10.1007/s10969-006-9010-3

Coverage of whole proteome by structural genomics observed through protein homology modeling database

Original Paper
Open access
Published: 05 December 2006

Volume 7, pages 65–76, (2006)
Cite this article

Download PDF

You have full access to this open access article

Journal of Structural and Functional Genomics

Coverage of whole proteome by structural genomics observed through protein homology modeling database

Download PDF

Kei Yura¹,
Akihiro Yamaguchi² &
Mitiko Go^2,3

1271 Accesses
10 Citations
Explore all metrics

Abstract

We have been developing FAMSBASE, a protein homology-modeling database of whole ORFs predicted from genome sequences. The latest update of FAMSBASE (http://daisy.nagahama-i-bio.ac.jp/Famsbase/), which is based on the protein three-dimensional (3D) structures released by November 2003, contains modeled 3D structures for 368,724 open reading frames (ORFs) derived from genomes of 276 species, namely 17 archaebacterial, 130 eubacterial, 18 eukaryotic and 111 phage genomes. Those 276 genomes are predicted to have 734,193 ORFs in total and the current FAMSBASE contains protein 3D structure of approximately 50% of the ORF products. However, cases that a modeled 3D structure covers the whole part of an ORF product are rare. When portion of an ORF with 3D structure is compared in three kingdoms of life, in archaebacteria and eubacteria, approximately 60% of the ORFs have modeled 3D structures covering almost the entire amino acid sequences, however, the percentage falls to about 30% in eukaryotes. When annual differences in the number of ORFs with modeled 3D structure are calculated, the fraction of modeled 3D structures of soluble protein for archaebacteria is increased by 5%, and that for eubacteria by 7% in the last 3 years. Assuming that this rate would be maintained and that determination of 3D structures for predicted disordered regions is unattainable, whole soluble protein model structures of prokaryotes without the putative disordered regions will be in hand within 15 years. For eukaryotic proteins, they will be in hand within 25 years. The 3D structures we will have at those times are not the 3D structure of the entire proteins encoded in single ORFs, but the 3D structures of separate structural domains. Measuring or predicting spatial arrangements of structural domains in an ORF will then be a coming issue of structural genomics.

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

A large-scale evaluation of algorithms to calculate average nucleotide identity

Article 15 February 2017

Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Genome sequencing projects provided a huge number of amino acid sequences without functional information (Stein 2001). To discover biological functions of those proteins, both computational predictions and biochemical experiments are necessary (Tsoka and Ouzounis 2000). Most of the proteins perform functions after forming specific 3D structures, and therefore protein 3D structure is one of the most valuable sources of information to predict protein function (Domingues et al. 2000; Xie and Bourne 2005). Protein function prediction based on 3D structures, especially protein surface structures, with evolutionary and/or physicochemical characteristics have been extensively studied (Lichtarge and Sowa 2002; Campbell et al. 2003; Kinoshita and Nakamura 2003; Laskowski et al. 2003; Ota et al. 2003; Pieper et al. 2006). However, determining protein structures of all the function-unknown proteins for applying these types of study is not practical.

Proteins are classified into a large number of ‘families’ based on the amino acid sequence similarity (Dayhoff 1972), and proteins with similar amino acid sequences are known to have similar 3D structures (Chothia and Lesk 1986), all because the proteins in a family are evolutionary related (Doolittle 1995). Once we have 3D structure of at least one of the proteins in a family, then 3D structures of other proteins in the same family can be computationally deduced by ‘homology modeling’ (Burley 2000; Baker and Sali 2001). Based on this logic, structural genomics (SG) projects, which are to determine protein 3D structures of representatives for each family have been proposed and launched (Vitkup et al. 2001; Brenner 2000; Burley and Bonnano 2002). In homology modeling, corresponding residues between an amino acid sequence of structure unknown protein (target) and that of 3D structure known protein (template) in the same family are determined by sequence alignment and every residue in a template protein is replaced by that in a target protein (Marti-Renoma et al. 2000).

SG projects have been providing new protein structures (Todd et al. 2005; Xie and Bourne 2005; Chandonia and Brenner 2006). Protein Data Bank (PDB) (Berman et al. 2000) now contains more than 390 3D structures for function unknown or hypothetical proteins (Stark et al. 2004). Protein function predictions based on 3D structures determined by SG projects are also in progress (Goldsmith-Fischman and Honig 2003; Liu et al. 2005; Petrey and Honig 2005). There are some projects that focus on a specific species and try to determine the 3D structures of whole proteins encoded in the genome of the species (Kim 2000; Yokoyama et al. 2000; Kim et al. 2003). Those projects provide a considerable number of 3D structures in a single protein family. This results in providing multiple templates for a single protein family and it can improve quality of homology modeling (Contreras-Moreira et al. 2003).

We have developed FAMSBASE; a database for homology modeling 3D structures of whole proteins predicted on whole genome sequences, since 2001 (Yamaguchi et al. 2003; http://daisy.nagahama-i-bio.ac.jp/Famsbase/). FAMSBASE contains results of homology modeling by FAMS, a full automatic modeling software (Ogata and Umeyama 2000). Sequence alignments between whole ORFs and proteins in PDB are based on GTOP (Kawabata et al. 2002).

We report here the update of the database including differences in the amount of structural data from the previous version, estimation of the time that whole ORFs predicted out of genome sequences are covered by homology modeling 3D structures and upcoming issues for utilizing those modeled structures.

Methods

Data update of FAMSBASE

Correspondence between ORFs derived from whole genome sequences and protein amino acid sequences whose 3D structures are known is provided by GTOP database (Kawabata et al. 2002). The update in May 2005 of FAMSBASE is based on February 2004 version of GTOP. Protein 3D structures in PDB by November 2003 are used for homology modeling templates. FAMS (Ogata and Umeyama 2000) is applied by Umeyama et al. to pair-wise alignments between a predicted ORF sequence and an amino acid sequence with known 3D structure, and a 3D structure is modeled. All the results are stored in FAMSBASE.

Assessing annual difference of data in FASBASE

Based on the amount of data in FAMSBASE in 2001 and the amount of increase in the following years, a due year for whole proteome 3D structure models is estimated. Estimation is done residue-wise, not ORF-wise, since modeled structures in FAMSBASE are often limited to structural domains. In this report, structural domains refer to SCOP domains (Andreeva et al. 2004). All ORFs predicted out of genome sequences are divided into soluble and membrane proteins. The division is carried out by SOSUI (Hirokawa et al. 1998), and a protein with one or more transmembrane regions is classified into a membrane protein. The number of residues of whole soluble proteins encoded in the genome sequence (G) of species i is denoted as S$_{{\rm G}i}$, and the number of residues of whole membrane proteins is denoted as M$_{{\rm G}i}$. The number of residues included in modeled 3D structures of soluble and membrane proteins are denoted as S3$_{{{\rm G}i}}$ and M3$_{{{\rm G}i}}$, respectively. For a certain genome Gi, the coverage of modeled 3D structures in whole soluble proteins is then S3$_{{\rm G}i}$/S $_{{\rm G}i}\times100$ and the coverage for whole membrane proteins is M3$_{{\rm G}i}$/M$_{{\rm G}i}\times100$. The coverage is summarized in different kingdoms of life as in the following equations;

$$ \begin{aligned} \hbox{coverage of soluble protein }&=\sum\limits_{i\in {\rm kingdom}} {{\rm S3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm S}_{{\rm G}i}}\times 100,\\ \hbox{coverage of membrane protein }&=\sum\limits_ {i\in {\rm kingdom}} {{\rm M3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm M}_{{\rm G}i}}\times100. \end{aligned} $$

Both figures are calculated based on the data at the different times of FAMSBASE update, gradients in figures are then calculated, and the figures are extrapolated up to the year that coverage reaches to 100.

It is getting to be known that not all ORFs assume stable 3D structures. Some parts of ORFs are considered to be natively disordered (Oldfield et al. 2005; Dyson and Wright 2005). Hence it is unlikely that coverage by homology modeling reaches to 100. We, therefore, estimate disordered regions in whole ORFs by DisEMBL (Linding et al. 2003) and omit these disordered regions from the calculation.

Non-overlap multiple model structures in single ORFs

Modeled 3D structures in FAMSBASE are often limited to structural domains. To find an ORF of which most of the entire 3D structure is modeled in pieces of structural domains, an ORF covered by non-overlapping three or more modeled 3D structures in eukaryotic genome is surveyed based on the following criteria; (1) 70% or more residues in the ORF are included in one of the modeled 3D structures, (2) the ORF contains three or more non-overlapping modeled structures, and (3) the sequence identity between a template protein and a target domain is no less than 25%. At the time of FAMSBASE building, five model structures are at most built for each ORF (Yamaguchi et al. 2003). Therefore, the expected number of modeled structures in the above criteria is between three and five.

Prediction of domain interfaces

The 3D structure in pieces for a single ORF needs to be assembled to model the entire 3D structure. For this procedure, a prediction of domain interfaces of each 3D structure is needed. A hydrophobicity index based on protein 3D structures is built for domain interface prediction. Hydrophobicity of amino acid residue is measured by buriedness of a residue inside the protein 3D structures. A representative 4,529 chains in PDB among which sequence identities are less than 30% were selected and solvent accessibility of each residue is calculated on a monomer state. For each amino acid residue type i (i = 1,...,20), the number of residue with accessibility no less than b (=0.0 − 1.0) is counted (S _b,i). Database derived hydrophobicity index (I _b,i) is obtained by;

$$ I_{b,i} =-\log_2 \left({\left({S_{b,i}/\sum\limits_i{S_{b,i}} }\right)/\left({S_{0,i}/\sum\limits_i{S_{0,i}}}\right)}\right). $$

b is set to 0.15 to maximize the difference of I _b,i among different residues. The index has good correlation with Kyte and Doolittle hydrophobicity index (Kyte and Doolittle 1982). The index I _0.15,i is assigned to every residue on the surface (accessibility no less than 0.15) of a modeled 3D structure. The hydrophobicity of each residue on a surface of a protein is then obtained by averaging the assigned values of residues within 7.0 Å from the residue in concern. A hydrophobic patch on the surface of the modeled structure is found as a cluster of surface residues with the hydrophobicity no less than 0.0.

Results and discussion

Coverage of whole protein space by homology modeling

The latest update of FAMSBASE at May 2005 uses protein 3D structures deposited to PDB by the end of Nov. 2003 and ORFs predicted from genome sequences deposited by February 2004 (http://daisy.nagahama-i-bio.ac.jp/ Famsbase/). The latest FAMSBASE contains 1,396,272 modeled 3D structures of 368,724 ORFs derived from 17 archaebacterial, 130 eubacterial, 18 eukaryotic and 111 phage genomes; in total 276 genomes. Five models at maximum are built for each ORF in FAMSBASE. Those five models are the structure for the same or different regions in the ORF. When multiple models are built for the same region of ORF, we can evaluate the reliability of the model. When the model based on different templates have the similar 3D structures, then the 3D structure would be reliable. When the structures are different, the modeled structure would be less reliable. We further test the quality of modeled 3D structure by ProsaII (Sippl 1993) and find that about 72% of the modeled 3D structures are energetically ranked as number one and comparable to experimentally determined 3D structures. Some of the structures that fail the test are structures of a part of a large protein, mostly structural domains of large proteins. It is difficult to assess the quality of this type of domain structures, because interfaces of the domain for other parts of the protein are exposed in the modeled structures. Tendency of amino acid residue appearance in the interface is supposed to be different from that at the surface as we discuss down below.

In the genome of 276 species, 734,193 ORFs are predicted. Therefore, in FAMSBASE, 3D structure of 50% (368,724/734,193) of ORFs have been built and stored (Table 1). These are about 47% of ORFs in archaebacterial genomes, about 52% in eubacterial genomes and about 49% of eukaryotic genomes.

Table 1 Number of ORFs and those with modeled 3D structures in 276 genomes

Full size table

When a modeled 3D structure is counted based on the number of amino acid residues, not on the number of ORFs, a different aspect emerges. Figure 1 shows the percentage of amino acid residues per ORF included in the modeled structures. ORFs without a modeled structure are omitted. Of archaebacterial and eubacterial genomes, in 60% of ORFs, more than 80% of the residues are included in modeled 3D structures, however, of eukaryotic genomes, only in 30% of ORFs, more than 80% of the residues are included (red and blue sections in Fig. 1). The proportion of residues in modeled 3D structure can be measured by the number of residues in a typical structural domain as shown in SCOP (Andreeva et al. 2004). The average size of protein domain is around 100–150 residues (Copley et al. 2002). In ORFs with modeled structures, a continuous region of residues with one domain or more remains as structure unknown in only about 18% of ORFs of archaebacterial and eubacterial genomes, whereas in about 60% of ORFs of eukaryotic genomes, the regions with one domain or more remain as structure unknown.

Annual difference of model structures

In FAMSBASE of 2001, 38% of amino acid residues in all ORFs of archaebacterial and 40% of eubacterial genomes were included in modeled 3D structures (Yamaguchi et al. 2003). In the current update of FAMSBASE based on data by around 2004, 42% of amino acid residues in all ORFs in archaebacterial and 46% of eubacterial genomes are included in modeled structures. In eukaryotic genomes, 24% of amino acid residues in 2003, and 26% in 2004 are included in modeled 3D structures. Those figures can be used to estimate the time when modeled 3D structures of whole proteins predicted from genomes are obtained. The estimation for the time obtaining the whole soluble and membrane proteins are treated separately, because the speed of structure determination for soluble and membrane proteins seems to differ. The assumption for the estimation is that the speed for structure determination would stay the same and no new protein family would appear.

For eubacterial genomes, 72.6% of residues in whole ORFs are predicted by SOSUI (Hirokawa et al. 1998) to encode soluble proteins and 27.4% to encode membrane proteins. This ratio is not so different from the previous prediction by Krogh et al. (2001). Of about 40% of whole eubacterial ORF that were with modeled 3D structures in 2001, approximately 90% were soluble proteins and 10% were membrane proteins. Therefore, about 50% (=0.40 × 0.90/0.726) of the whole soluble proteins were modeled. Of the whole membrane proteins in eubacterial genome, about 15% (=0.40 × 0.10/0.274) were modeled. In 2004, those figures are grown to 57% and 19%, respectively. In eubacterial whole ORFs, about 19.9% of amino acid residues are predicted to be included in disordered region by DisEMBL (Linding et al. 2003). Some of these regions are included in the modeled structures. These regions are either incorrectly predicted regions or incorrectly modeled regions. Assuming that the disordered regions without modeled 3D structures are correctly predicted, 10.8% of amino acid residues in soluble proteins were disordered and we would never obtain 3D structures of those regions. Then, by extrapolating the coverage of soluble proteins up to 89.2% (100–10.8) with the current growth rate, we can estimate that, by the year 2017, whole soluble proteins encoded in eubacterial genomes can be modeled (Fig. 2). Whole soluble proteins of archaebacterial genome can be modeled by 2021 and those of eukaryotic genomes, by 2031.

Orengo et al. (1999) showed percentage of ORFs with protein 3D structures as between 30 and 46% in 1999. The genome sequences known by 1999 were mostly derived from prokaryotic species and the known protein 3D structures were mostly soluble proteins. Therefore, the figures they presented in 1999 should correspond to the figures of archaebacterial and eubacterial soluble proteins. When we extrapolate the figures of archaebacterial and eubacterial soluble proteins to the past in Fig. 2, the figures are around 40% in 1999, indicating that their figures approximately lie on the extrapolated lines.

The current estimation indicates that we will obtain 3D structures of whole soluble proteins of eubacteria in 11 years and archaebacteria in 15 years. This estimation does not take into account the acceleration of structure determination speed by automation (McPherson 2004; DeLucas et al. 2005), which makes the due days closer to the present. For membrane proteins, speed of structure determination has been drastically accelerated by recent technical innovations (Kyogoku et al. 2003; Lundstrom 2004; Walian et al. 2004; Dobrovetsky et al. 2005), and therefore we will not linearly extrapolate the present status to estimate the due day for membrane proteins.

Frequency of template structure in use

When the template 3D structures used in FAMSBASE are classified by SCOP superfamily, which is a group of proteins that have low sequence identities but whose structural and functional features suggest that a common evolutionary origin is probable (Lo Conte et al. 2002), and frequencies of superfamilies in use are counted, ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily is found to be the most frequent one; 7,532 times (about 12%) in whole archaebacterial model structures, 77,806 (about 10%) in eubacterial structures and 35,468 times (about 6%) in eukaryotic structures. The templates that follow in frequency in archaebacterial and eubacterial protein structures are ‘NAD(P)-binding Rossmann fold domains’, ‘4Fe–4S ferredxin’, and ‘PLP-dependent transferases’ superfamilies. In eukaryotic protein structures, ‘protein-kinase’, ‘immunoglobulin’ and ‘C₂H₂ and C₂HC zinc fingers’ superfamilies, which appear specifically in eukaryotic genomes, follow the top.

Differences in distribution of frequency of templates in different kingdoms of life are evident, when frequencies in use of template are plotted in descending order (Fig. 3). In any kingdoms of life, the frequencies of the most and the second most used templates exceed those of the remaining templates. The frequencies of templates in use drops first in archaebacterial protein structures and then in eubacterial protein structures. The descending curve of eukaryotic template frequency is less steep compared with the others, indicating that one template can produce a large number of domain 3D structures in eukaryotic ORFs. In other words, a significant number of proteins encoded in eukaryotic genomes are originated by domain duplication, as Koonin et al. (2000) demonstrated. Superfamilies with the 3D structures and with many copies in eukaryotic genomes, but seldom in prokaryotic genomes are ‘protein kinase-like’, ‘immunoglobin’, ‘RNA-binding domain’, ‘C₂H₂ and C₂HC zinc fingers’, ‘WD40-repeat’, ‘glucocorticoid receptor-like’, ‘homeodomain-like’, ‘PH domain-like’, ‘RING-box’, ‘L domain’, ‘ankyrin repeat’, ‘ARM repeat’, ‘cytochrome P-450’ and ‘EF-hand’ superfamilies. These superfamilies are transcription factors, protein–protein interaction mediators and response factor for toxic substances, mostly known to be unique to eukaryotes.

The ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily outnumbering other superfamily in template frequency corresponds to the previous finding that the enzyme is highly frequently used in every kingdom of life (Leipe et al. 2003). When biological functions of these ORFs with the 3D structure of ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily are classified, about half of the proteins are ABC transporters in archaebacterial and eubacterial proteomes, but numbers of G-proteins and motor proteins in eukaryotic proteomes are noticeable (Fig. 4).

In the last two years, new protein structures were determined and contributed to an increase in the number of templates for homology modeling. A part of those template structures are listed in Table 2. A part of those top 15 templates contributed a lot for the growth of modeled 3D structure database. In Table 2, 3D structure derived from SG projects is rare. The ratio of SG products in Table 2 is the same as that in PDB (Editorial Board, Nature Structural & Molecular Biology 2004). As the SG projects in US and Europe have proceeded to phase 2 (Service 2005), SG products are expected to contribute to increase in the number of templates in the near future. The qualities of protein 3D structures, namely, size, resolution, R-factors and so forth, derived from SG projects were compared with those in PDB and no obvious compromise in quality of SG products were found (Todd et al. 2005). The quality of homology modeling based on products of SG projects in the future, therefore, will be expected to be no less than the current quality.

Table 2 Top 15 modeling templates in the newly determined 3D structures between 2002 and 2003

Full size table

Whole structure and function of proteins from homology modeling of domain structures

Protein function prediction, especially studies on enzyme specificity, based on homology modeling structures is intensively carried out in the field of drug design and related fields (Goldsmith-Fischman and Honig 2003; Kopp and Schwede 2004). Those studies are mostly based on homology modeling of domain structures. As mentioned above, most of the eukaryotic protein structures in FAMSBASE are 3D structures of structural domains, not the entire coding regions (Fig. 1). Protein functional sites are often located at a cleft of domains (Laskowski et al. 1996), and therefore understanding relative location of domains will be a critical issue. Xie and Bourne (2005) and O’Toole et al. (2003) also pointed out this problem and mentioned, “even if all the domains of a multiple-domain query sequence have determined structures, the individual structures will not enable accurate modeling of how they associate together in the structure of the entire proteins (O’Toole et al. 2003).”

Figure 5 shows all eukaryotic ORFs whose 3D structures are mostly modeled in pieces. There are three types of enzymes and four types of cell surface receptors. A protein structure of ENSP00000264705 which is an ORF found in human genome can be modeled based on Escherichia coli carbamoylphosphate synthetase (CPS) and Pyrococcus abyssi asparatate transcarbamoylase (ATC). E. coli CPS is composed of a large subunit and a small subunit. CPS and ATC are the first and the second enzymes, respectively, in pyrimidine biosynthesis pathway. In mammalian genomes, those proteins are coded by a single gene and active in a hexamer form (Serre et al. 2004). Interactions between the large subunit domain and the small subunit domain of human CPS are conjectured to be the same as those between the large and the small subunits of E. coli CPS. N-terminal residues of the large subunit and the C-terminal residues of the small subunit are spatially located close in E. coli CPS, which permits the two chains to be chemically connected without disrupting subunit interfaces. To be active, human CPS should form a hexamer supramolecule and the interfaces for the supramolecule formation should be predicted from the modeled 3D structures. At the moment, the interfaces are unknown.

ENSMUSP00000019416 is an ORF found in mouse genome and encodes a putative cell surface receptor. The protein is predicted to consist of six consecutive Ig-fold domains. There is a putative transmembrane helix at the C-terminal region of the protein. Two consecutive Ig-fold domains are modeled without overlap, and no pieces of information for relative orientation of three modeled structures have been found. Information of interaction sites of those domains is required to build the entire structure of the protein and to predict a target molecule of this receptor. Computational analyses of domain interfaces and of protein–protein interfaces have been targets for extensive study for a long time, and some general characteristics have been found. One of them is the hydrophobicity of the interfaces (Wodak and Janin 2002). Hydrophobic clusters on the surface of modeled structures of ENSMUSP00000019416 are shown in right side of Fig. 5. One of the template structures, Nkp46 ectodomain, has hydrophilic surface (green) around the C-terminal residues of the domain, however the modeled structure has a hydrophobic surface (orange) at the corresponding area. The other template structure, LIR-1 D1D2, has a hydrophilic surface around the N-terminal residues of the domain, however the modeled structure has a hydrophobic surface at the corresponding area. The surfaces uniquely turned into hydrophobic in modeled structures are close to the residues that are chemically bonded in the target protein, and therefore both of the areas likely form interfaces of the two domains. The modeled structure based on LIR-1 D1D2 domain has another hydrophobic surface around the C-terminal residues, which may interact with CD158j-like domain located at the C-terminal side of the domain.

Accuracy of homology modeling

There are at least three major issues that affect accuracy in homology modeling; the best template selection, accuracy of an amino acid sequence alignment between template and target protein sequences and the accuracy of structure building procedure itself (Contreras-Moreira et al. 2005). Accuracy of the alignment is high, when sequence identity of template and target proteins is higher than 30%, and alignment of proteins with identity less than 30% is known to be less reliable, thereby accuracy of homology modeling deteriorates (Kopp and Schwede 2004). FAMS has been shown to construct relatively accurate model structures, even with low sequence identity between template and target sequences in CAFASP2, the homology modeling competition (Iwadate et al. 2001; Yamaguchi et al. 2003). A distribution of sequence identity between amino acid sequences of template and target proteins in FAMSBASE is shown in Fig. 6. Half of the model structures in FAMSBASE rely on alignments of sequence identity less than 20%. Figure 6 suggests that the current 3D structure database does not contain good enough structures for high quality homology modeling. SG projects will eventually provide better template structures, and improvement in target selection, alignment and modeling methods are also in pursuit to overcome the difficulties in homology modeling (John and Sali 2003; Wallace et al. 2005).

Conclusion

Construction of database of whole genome homology modeling clarified that protein 3D structures of about 50% of the protein coding regions in whole genome can now be modeled. Maintaining the current speed of 3D structure determination, it will take, at most, 11 years to have enough templates to cover whole soluble proteins of eubacterial genomes, and 25 years to cover those of eukaryotic genomes. The current advancement in technologies of protein structure determination is expected to make these due times closer to the present. What we obtain at those times are not the 3D structures of entire proteins, but domain structures in pieces. A homology modeled domain structure is now in use of predicting domain functions, but predicting spatial arrangement of domains in a protein will be an important issue for function prediction.

References

Andreeva A, Howorth D, Brenner SE, Hubbard TJP, Chothia C, Murzin AG (2004) Nucleic Acids Res 32:D22
Article CAS Google Scholar
Baker D, Sali A (2001) Science 294:93
Article PubMed CAS Google Scholar
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) Nucleic Acids Res 28:235
Article PubMed CAS Google Scholar
Brenner SE (2000) Nat Struct Biol 7(Suppl):967
Article PubMed CAS Google Scholar
Burley SK (2000) Nat Struct Biol 7(Suppl):932
Article PubMed CAS Google Scholar
Burley SK, Bonanno JB (2002) Annu Rev Genomics Hum Genet 3:243
Article PubMed CAS Google Scholar
Campbell SJ, Gold ND, Jackson RM, Westhead DR (2003) Curr Opin Struct Biol 13:389
Article PubMed CAS Google Scholar
Chandonia J-M, Brenner SE (2006) Science 311:347
Article PubMed CAS Google Scholar
Chothia C, Lesk AM (1986) EMBO J 5:823
PubMed CAS Google Scholar
Contreras-Moreira B, Ezkurdia I, Valencia TA (2005) FEBS lett 579:1203
Article PubMed CAS Google Scholar
Contreras-Moreira B, Fitzjohn PW, Bates PA (2003) J Mol Biol 328:593
Article PubMed CAS Google Scholar
Copley RR, Ponting CP, Schultz J, Bork P (2002) Adv Protein Chem 61:75
PubMed Google Scholar
Dayhoff MO (1972) Atlas of protein sequence and structure, vol 5. National Biomedical Research, Foundation Georgetown University, Washington, DC
Google Scholar
DeLucas LJ, Hamrick D, Cosenza L, Nagy L, McCombs D, Bray T, Chait A, Stoops B, Belgovskiy A, William Wilson W (2005) Prog Biophys Mol Biol 88:285
Article PubMed CAS Google Scholar
Dobrovetsky E, Lu ML, Andorn-Broza R, Khutoreskaya G, Bray JE, Savchenko A, Arrowsmith CH, Edwards AM, Koth CM (2005) J Struct Funct Genomics 6:33
Article PubMed CAS Google Scholar
Domingues FS, Koppensteiner WA, Sippl MJ (2000) FEBS Lett 476:98
Article PubMed CAS Google Scholar
Doolittle RF (1995) Annu Rev Biochem 64:287
Article PubMed CAS Google Scholar
Dyson HJ, Wright PE (2005) Nat Rev Mol Cell Biol 6:197
Article PubMed CAS Google Scholar
Editorial Board (2004) Nat Struct Mol Biol 11:201
Article CAS Google Scholar
Goldsmith-Fischman S, Honig B (2003) Protein Sci 12:1813
Article PubMed CAS Google Scholar
Hirokawa T, Boon-Chieng S, Mitaku S (1998) Bioinformatics 14:378
Article PubMed CAS Google Scholar
Iwadate M, Ebisawa K, Umeyama H (2001) Chem-Bio Infor J 1:136
Article Google Scholar
John B, Sali A (2003) Nucleic Acids Res 31:3982
Article PubMed CAS Google Scholar
Kawabata T, Fukuchi S, Homma K, Ota M, Araki J, Ito T, Ichiyoshi N, Nishikawa K (2002) Nucleic Acids Res 30:294
Article PubMed CAS Google Scholar
Kim S-H, Shin DH, Choi I-G, Schulze-Gahmen U, Chen S, Kim R (2003) J Struct Funct Genomics 4:129
Article PubMed CAS Google Scholar
Kim S-H (2000) Curr Opin Struct Biol 10:380
Article PubMed CAS Google Scholar
Kinoshita K, Nakamura H (2003) Curr Opin Struct Biol 13:396
Article PubMed CAS Google Scholar
Koonin EV, Aravind L, Kondrashov AS (2000) Cell 101:573
Article PubMed CAS Google Scholar
Kopp J, Schwede T (2004) Nucleic Acids Res 32:D230
Article PubMed CAS Google Scholar
Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) J Mol Biol 305:567
Article PubMed CAS Google Scholar
Kyogoku Y, Fujiyoshi Y, Shimada I, Nakamura H, Tsukihara T, Akutsu H, Odahara T, Okada T, Nomura N (2003) Acc Chem Res 36:199
Article PubMed CAS Google Scholar
Kyte J, Doolittle RF (1982) J Mol Biol 157:105
Article PubMed CAS Google Scholar
Laskowski RA, Watson JD, Thornton JM (2003) J Struct Funct Genomics 4:167
Article PubMed CAS Google Scholar
Laskowski RA, Luscombe NM, Swindells MB, Thornton JM (1996) Protein Sci 5:2438
PubMed CAS Google Scholar
Leipe DD, Koonin EV, Aravind L (2003) J Mol Biol 333:781
Article PubMed CAS Google Scholar
Lichtarge O, Sowa ME (2002) Curr Opin Struct Biol 12:21
Article PubMed CAS Google Scholar
Linding R, Jensen LJ, Diella F, Bork P, Gibson TJ, Russell RB (2003) Structure 11:1453
Article PubMed CAS Google Scholar
Liu G, Li Z, Chiang Y, Acton T, Montelione GT, Murray D, Szyperski T (2005) Protein Sci 14:1597
Article PubMed CAS Google Scholar
Lo Conte L, Brenner SE, Hubbard TJ, Chothia C, Murzin AG (2002) Nucleic Acids Res 30:264
Article PubMed CAS Google Scholar
Lundstrom K (2004) Curr Opin Drug Discov Devel 7:342
PubMed CAS Google Scholar
Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A (2000) Annu Rev Biophys Biomol Struct 29:291
Article PubMed CAS Google Scholar
McPherson A (2004) J Struct Funct Genomics 5:3
Article PubMed CAS Google Scholar
Ogata K, Umeyama H (2000) J Mol Graph Model 18:258
Google Scholar
Oldfield CJ, Ulrich EL, Cheng Y, Dunker AK, Markley JL (2005) Proteins 59:444
Article PubMed CAS Google Scholar
Orengo CA, Pearl FM, Bray JE, Todd AE, Martin AC, Lo Conte L, Thornton JM (1999) Nucleic Acids Res 27:275
Article PubMed CAS Google Scholar
Ota M, Kinoshita K, Nishikawa K (2003) J Mol Biol 327:1053
Article PubMed CAS Google Scholar
O’Toole N, Raymond S, Cygler M (2003) J Struct Funct Genomics 4:47
Article PubMed CAS Google Scholar
Petrey D, Honig B (2005) Mol Cell 20:811
Article PubMed CAS Google Scholar
Pieper U, Eswar N, Davis FP, Braberg H, Madhusudhan MS, Rossi A, Marti-Renom M, Karchin R, Webb BM, Eramian D, Shen M-Y, Kelly L, Melo F, Sali A (2006) Nucleic Acids Res 34:D291
Article PubMed CAS Google Scholar
Serre V, Penverne B, Souciet JL, Potier S, Guy H, Evans D, Vicart P, Herve G (2004) BMC Biochem 5:6
Article PubMed Google Scholar
Service R (2005) Science 307:1554
Article PubMed CAS Google Scholar
Sippl MJ (1993) Proteins Struct Funct Genet 17:355
Article CAS Google Scholar
Stark A, Shkumatov A, Russell RB (2004) Structure (Camb) 12:1405
Article CAS Google Scholar
Stein L (2001) Nat Rev Genet 2:493
Article PubMed CAS Google Scholar
Todd AE, Marsden RL, Thornton JM, Orengo CA (2005) J Mol Biol 348:1235
Article PubMed CAS Google Scholar
Tsoka S, Ouzounis CA (2000) FEBS Letters 480:42
Article PubMed CAS Google Scholar
Vitkup D, Melamud E, Moult J, Sander C (2001) Nat Struct Biol 8:559
Article PubMed CAS Google Scholar
Walian P, Cross T, Jap B (2004) Genome Biol 5:215
Article PubMed Google Scholar
Wallace IM, Blackshields G, Higgins DG (2005) Curr Opin Struct Biol 15:261
Article PubMed CAS Google Scholar
Wodak SJ, Janin J (2002) Adv Protein Chem 61:9
Article PubMed Google Scholar
Xie L, Bourne PE (2005) PLoS Comput Biol 1:e31
Article PubMed CAS Google Scholar
Yamaguchi A, Iwadate M, Suzuki E, Yura K, Kawakita S, Umeyama H, Go M (2003) Nucleic Acids Res 31:463
Article PubMed CAS Google Scholar
Yokoyama S, Hirota H, Kigawa T, Yabuki T, Shirouzu M, Terada T, Ito Y, Matsuo Y, Kuroda Y, Nishimura Y, Kyogoku Y, Miki K, Masui R, Kuramitsu S (2000) Nat Struct Biol 7(Suppl):943
Article PubMed CAS Google Scholar

Download references

Acknowledgements

This work was supported by a Grant-in-Aid for Scientific Research on Priority Area (C), Genome Information Science from the Ministry of Education, Culture, Sports, Science, and Technology of Japan.

Author information

Authors and Affiliations

Quantum Bioinformatics Team, Center for Computational Science and Engineering, Japan Atomic Energy Agency, 8-1 Umemidai, Kizu-cho, Souraku-gun, Kyoto, 619-0215, Japan
Kei Yura
Department of Bio-Science, Faculty of Bio-Science, Nagahama Institute of Bio-Science and Technology, 1266, Tamura-cho, Nagahama, Shiga, 526-0829, Japan
Akihiro Yamaguchi & Mitiko Go
Ochanomizu University, 2-1-1 Otsuka, Bunkyo-ku, Tokyo, 112-8610, Japan
Mitiko Go

Authors

Kei Yura
View author publications
You can also search for this author in PubMed Google Scholar
Akihiro Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar
Mitiko Go
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kei Yura.

Rights and permissions

Open Access This is an open access article distributed under the terms of the Creative Commons Attribution Noncommercial License ( https://creativecommons.org/licenses/by-nc/2.0 ), which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.

Reprints and permissions

About this article

Cite this article

Yura, K., Yamaguchi, A. & Go, M. Coverage of whole proteome by structural genomics observed through protein homology modeling database. J Struct Funct Genomics 7, 65–76 (2006). https://doi.org/10.1007/s10969-006-9010-3

Download citation

Received: 11 May 2006
Accepted: 08 August 2006
Published: 05 December 2006
Issue Date: June 2006
DOI: https://doi.org/10.1007/s10969-006-9010-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Coverage of whole proteome by structural genomics observed through protein homology modeling database

Abstract

Similar content being viewed by others

A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome

A large-scale evaluation of algorithms to calculate average nucleotide identity

Protein Data Bank (PDB): The Single Global Macromolecular Structure Archive

Introduction