Introduction

Genome sequencing projects provided a huge number of amino acid sequences without functional information (Stein 2001). To discover biological functions of those proteins, both computational predictions and biochemical experiments are necessary (Tsoka and Ouzounis 2000). Most of the proteins perform functions after forming specific 3D structures, and therefore protein 3D structure is one of the most valuable sources of information to predict protein function (Domingues et al. 2000; Xie and Bourne 2005). Protein function prediction based on 3D structures, especially protein surface structures, with evolutionary and/or physicochemical characteristics have been extensively studied (Lichtarge and Sowa 2002; Campbell et al. 2003; Kinoshita and Nakamura 2003; Laskowski et al. 2003; Ota et al. 2003; Pieper et al. 2006). However, determining protein structures of all the function-unknown proteins for applying these types of study is not practical.

Proteins are classified into a large number of ‘families’ based on the amino acid sequence similarity (Dayhoff 1972), and proteins with similar amino acid sequences are known to have similar 3D structures (Chothia and Lesk 1986), all because the proteins in a family are evolutionary related (Doolittle 1995). Once we have 3D structure of at least one of the proteins in a family, then 3D structures of other proteins in the same family can be computationally deduced by ‘homology modeling’ (Burley 2000; Baker and Sali 2001). Based on this logic, structural genomics (SG) projects, which are to determine protein 3D structures of representatives for each family have been proposed and launched (Vitkup et al. 2001; Brenner 2000; Burley and Bonnano 2002). In homology modeling, corresponding residues between an amino acid sequence of structure unknown protein (target) and that of 3D structure known protein (template) in the same family are determined by sequence alignment and every residue in a template protein is replaced by that in a target protein (Marti-Renoma et al. 2000).

SG projects have been providing new protein structures (Todd et al. 2005; Xie and Bourne 2005; Chandonia and Brenner 2006). Protein Data Bank (PDB) (Berman et al. 2000) now contains more than 390 3D structures for function unknown or hypothetical proteins (Stark et al. 2004). Protein function predictions based on 3D structures determined by SG projects are also in progress (Goldsmith-Fischman and Honig 2003; Liu et al. 2005; Petrey and Honig 2005). There are some projects that focus on a specific species and try to determine the 3D structures of whole proteins encoded in the genome of the species (Kim 2000; Yokoyama et al. 2000; Kim et al. 2003). Those projects provide a considerable number of 3D structures in a single protein family. This results in providing multiple templates for a single protein family and it can improve quality of homology modeling (Contreras-Moreira et al. 2003).

We have developed FAMSBASE; a database for homology modeling 3D structures of whole proteins predicted on whole genome sequences, since 2001 (Yamaguchi et al. 2003; http://daisy.nagahama-i-bio.ac.jp/Famsbase/). FAMSBASE contains results of homology modeling by FAMS, a full automatic modeling software (Ogata and Umeyama 2000). Sequence alignments between whole ORFs and proteins in PDB are based on GTOP (Kawabata et al. 2002).

We report here the update of the database including differences in the amount of structural data from the previous version, estimation of the time that whole ORFs predicted out of genome sequences are covered by homology modeling 3D structures and upcoming issues for utilizing those modeled structures.

Methods

Data update of FAMSBASE

Correspondence between ORFs derived from whole genome sequences and protein amino acid sequences whose 3D structures are known is provided by GTOP database (Kawabata et al. 2002). The update in May 2005 of FAMSBASE is based on February 2004 version of GTOP. Protein 3D structures in PDB by November 2003 are used for homology modeling templates. FAMS (Ogata and Umeyama 2000) is applied by Umeyama et al. to pair-wise alignments between a predicted ORF sequence and an amino acid sequence with known 3D structure, and a 3D structure is modeled. All the results are stored in FAMSBASE.

Assessing annual difference of data in FASBASE

Based on the amount of data in FAMSBASE in 2001 and the amount of increase in the following years, a due year for whole proteome 3D structure models is estimated. Estimation is done residue-wise, not ORF-wise, since modeled structures in FAMSBASE are often limited to structural domains. In this report, structural domains refer to SCOP domains (Andreeva et al. 2004). All ORFs predicted out of genome sequences are divided into soluble and membrane proteins. The division is carried out by SOSUI (Hirokawa et al. 1998), and a protein with one or more transmembrane regions is classified into a membrane protein. The number of residues of whole soluble proteins encoded in the genome sequence (G) of species i is denoted as S\(_{{\rm G}i}\), and the number of residues of whole membrane proteins is denoted as M\(_{{\rm G}i}\). The number of residues included in modeled 3D structures of soluble and membrane proteins are denoted as S3\(_{{{\rm G}i}}\) and M3\(_{{{\rm G}i}}\), respectively. For a certain genome Gi, the coverage of modeled 3D structures in whole soluble proteins is then S3\(_{{\rm G}i}\)/S \(_{{\rm G}i}\times100\) and the coverage for whole membrane proteins is M3\(_{{\rm G}i}\)/M\(_{{\rm G}i}\times100\). The coverage is summarized in different kingdoms of life as in the following equations;

$$ \begin{aligned} \hbox{coverage of soluble protein }&=\sum\limits_{i\in {\rm kingdom}} {{\rm S3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm S}_{{\rm G}i}}\times 100,\\ \hbox{coverage of membrane protein }&=\sum\limits_ {i\in {\rm kingdom}} {{\rm M3}_{{\rm G}i}} /\sum\limits_{i\in {\rm kingdom}} {{\rm M}_{{\rm G}i}}\times100. \end{aligned} $$

Both figures are calculated based on the data at the different times of FAMSBASE update, gradients in figures are then calculated, and the figures are extrapolated up to the year that coverage reaches to 100.

It is getting to be known that not all ORFs assume stable 3D structures. Some parts of ORFs are considered to be natively disordered (Oldfield et al. 2005; Dyson and Wright 2005). Hence it is unlikely that coverage by homology modeling reaches to 100. We, therefore, estimate disordered regions in whole ORFs by DisEMBL (Linding et al. 2003) and omit these disordered regions from the calculation.

Non-overlap multiple model structures in single ORFs

Modeled 3D structures in FAMSBASE are often limited to structural domains. To find an ORF of which most of the entire 3D structure is modeled in pieces of structural domains, an ORF covered by non-overlapping three or more modeled 3D structures in eukaryotic genome is surveyed based on the following criteria; (1) 70% or more residues in the ORF are included in one of the modeled 3D structures, (2) the ORF contains three or more non-overlapping modeled structures, and (3) the sequence identity between a template protein and a target domain is no less than 25%. At the time of FAMSBASE building, five model structures are at most built for each ORF (Yamaguchi et al. 2003). Therefore, the expected number of modeled structures in the above criteria is between three and five.

Prediction of domain interfaces

The 3D structure in pieces for a single ORF needs to be assembled to model the entire 3D structure. For this procedure, a prediction of domain interfaces of each 3D structure is needed. A hydrophobicity index based on protein 3D structures is built for domain interface prediction. Hydrophobicity of amino acid residue is measured by buriedness of a residue inside the protein 3D structures. A representative 4,529 chains in PDB among which sequence identities are less than 30% were selected and solvent accessibility of each residue is calculated on a monomer state. For each amino acid residue type i (i = 1,...,20), the number of residue with accessibility no less than b (=0.0 − 1.0) is counted (S b,i ). Database derived hydrophobicity index (I b,i ) is obtained by;

$$ I_{b,i} =-\log_2 \left({\left({S_{b,i}/\sum\limits_i{S_{b,i}} }\right)/\left({S_{0,i}/\sum\limits_i{S_{0,i}}}\right)}\right). $$

b is set to 0.15 to maximize the difference of I b,i among different residues. The index has good correlation with Kyte and Doolittle hydrophobicity index (Kyte and Doolittle 1982). The index I 0.15,i is assigned to every residue on the surface (accessibility no less than 0.15) of a modeled 3D structure. The hydrophobicity of each residue on a surface of a protein is then obtained by averaging the assigned values of residues within 7.0 Å from the residue in concern. A hydrophobic patch on the surface of the modeled structure is found as a cluster of surface residues with the hydrophobicity no less than 0.0.

Results and discussion

Coverage of whole protein space by homology modeling

The latest update of FAMSBASE at May 2005 uses protein 3D structures deposited to PDB by the end of Nov. 2003 and ORFs predicted from genome sequences deposited by February 2004 (http://daisy.nagahama-i-bio.ac.jp/ Famsbase/). The latest FAMSBASE contains 1,396,272 modeled 3D structures of 368,724 ORFs derived from 17 archaebacterial, 130 eubacterial, 18 eukaryotic and 111 phage genomes; in total 276 genomes. Five models at maximum are built for each ORF in FAMSBASE. Those five models are the structure for the same or different regions in the ORF. When multiple models are built for the same region of ORF, we can evaluate the reliability of the model. When the model based on different templates have the similar 3D structures, then the 3D structure would be reliable. When the structures are different, the modeled structure would be less reliable. We further test the quality of modeled 3D structure by ProsaII (Sippl 1993) and find that about 72% of the modeled 3D structures are energetically ranked as number one and comparable to experimentally determined 3D structures. Some of the structures that fail the test are structures of a part of a large protein, mostly structural domains of large proteins. It is difficult to assess the quality of this type of domain structures, because interfaces of the domain for other parts of the protein are exposed in the modeled structures. Tendency of amino acid residue appearance in the interface is supposed to be different from that at the surface as we discuss down below.

In the genome of 276 species, 734,193 ORFs are predicted. Therefore, in FAMSBASE, 3D structure of 50% (368,724/734,193) of ORFs have been built and stored (Table 1). These are about 47% of ORFs in archaebacterial genomes, about 52% in eubacterial genomes and about 49% of eukaryotic genomes.

Table 1 Number of ORFs and those with modeled 3D structures in 276 genomes

When a modeled 3D structure is counted based on the number of amino acid residues, not on the number of ORFs, a different aspect emerges. Figure 1 shows the percentage of amino acid residues per ORF included in the modeled structures. ORFs without a modeled structure are omitted. Of archaebacterial and eubacterial genomes, in 60% of ORFs, more than 80% of the residues are included in modeled 3D structures, however, of eukaryotic genomes, only in 30% of ORFs, more than 80% of the residues are included (red and blue sections in Fig. 1). The proportion of residues in modeled 3D structure can be measured by the number of residues in a typical structural domain as shown in SCOP (Andreeva et al. 2004). The average size of protein domain is around 100–150 residues (Copley et al. 2002). In ORFs with modeled structures, a continuous region of residues with one domain or more remains as structure unknown in only about 18% of ORFs of archaebacterial and eubacterial genomes, whereas in about 60% of ORFs of eukaryotic genomes, the regions with one domain or more remain as structure unknown.

Fig. 1
figure 1

Percentage of amino acid residues included in modeled 3D structures in each ORF is classified by 10% bins and shown in pie charts. ORFs without a modeled structure are not included. A number of ORFs with modeled structures and an average length of the ORFs are shown at the center of each pie chart. Sections bordered by thick black lines indicate that the unmodeled region in the ORF is no less than the size of a domain (about 150 residues)

Annual difference of model structures

In FAMSBASE of 2001, 38% of amino acid residues in all ORFs of archaebacterial and 40% of eubacterial genomes were included in modeled 3D structures (Yamaguchi et al. 2003). In the current update of FAMSBASE based on data by around 2004, 42% of amino acid residues in all ORFs in archaebacterial and 46% of eubacterial genomes are included in modeled structures. In eukaryotic genomes, 24% of amino acid residues in 2003, and 26% in 2004 are included in modeled 3D structures. Those figures can be used to estimate the time when modeled 3D structures of whole proteins predicted from genomes are obtained. The estimation for the time obtaining the whole soluble and membrane proteins are treated separately, because the speed of structure determination for soluble and membrane proteins seems to differ. The assumption for the estimation is that the speed for structure determination would stay the same and no new protein family would appear.

For eubacterial genomes, 72.6% of residues in whole ORFs are predicted by SOSUI (Hirokawa et al. 1998) to encode soluble proteins and 27.4% to encode membrane proteins. This ratio is not so different from the previous prediction by Krogh et al. (2001). Of about 40% of whole eubacterial ORF that were with modeled 3D structures in 2001, approximately 90% were soluble proteins and 10% were membrane proteins. Therefore, about 50% (=0.40 × 0.90/0.726) of the whole soluble proteins were modeled. Of the whole membrane proteins in eubacterial genome, about 15% (=0.40 × 0.10/0.274) were modeled. In 2004, those figures are grown to 57% and 19%, respectively. In eubacterial whole ORFs, about 19.9% of amino acid residues are predicted to be included in disordered region by DisEMBL (Linding et al. 2003). Some of these regions are included in the modeled structures. These regions are either incorrectly predicted regions or incorrectly modeled regions. Assuming that the disordered regions without modeled 3D structures are correctly predicted, 10.8% of amino acid residues in soluble proteins were disordered and we would never obtain 3D structures of those regions. Then, by extrapolating the coverage of soluble proteins up to 89.2% (100–10.8) with the current growth rate, we can estimate that, by the year 2017, whole soluble proteins encoded in eubacterial genomes can be modeled (Fig. 2). Whole soluble proteins of archaebacterial genome can be modeled by 2021 and those of eukaryotic genomes, by 2031.

Fig. 2
figure 2

Annual differences of modeled structures classified by kingdoms of life. The percentage is the number of amino acid residues included in modeled structures over the whole number of residues in predicted sequences for soluble and membrane proteins in each kingdom. (S) stands for soluble proteins and (M) stands for membrane proteins. Some of the residues are predicted to be in a disordered region. The percentage of residues in disordered regions is shown at the top

Orengo et al. (1999) showed percentage of ORFs with protein 3D structures as between 30 and 46% in 1999. The genome sequences known by 1999 were mostly derived from prokaryotic species and the known protein 3D structures were mostly soluble proteins. Therefore, the figures they presented in 1999 should correspond to the figures of archaebacterial and eubacterial soluble proteins. When we extrapolate the figures of archaebacterial and eubacterial soluble proteins to the past in Fig. 2, the figures are around 40% in 1999, indicating that their figures approximately lie on the extrapolated lines.

The current estimation indicates that we will obtain 3D structures of whole soluble proteins of eubacteria in 11 years and archaebacteria in 15 years. This estimation does not take into account the acceleration of structure determination speed by automation (McPherson 2004; DeLucas et al. 2005), which makes the due days closer to the present. For membrane proteins, speed of structure determination has been drastically accelerated by recent technical innovations (Kyogoku et al. 2003; Lundstrom 2004; Walian et al. 2004; Dobrovetsky et al. 2005), and therefore we will not linearly extrapolate the present status to estimate the due day for membrane proteins.

Frequency of template structure in use

When the template 3D structures used in FAMSBASE are classified by SCOP superfamily, which is a group of proteins that have low sequence identities but whose structural and functional features suggest that a common evolutionary origin is probable (Lo Conte et al. 2002), and frequencies of superfamilies in use are counted, ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily is found to be the most frequent one; 7,532 times (about 12%) in whole archaebacterial model structures, 77,806 (about 10%) in eubacterial structures and 35,468 times (about 6%) in eukaryotic structures. The templates that follow in frequency in archaebacterial and eubacterial protein structures are ‘NAD(P)-binding Rossmann fold domains’, ‘4Fe–4S ferredxin’, and ‘PLP-dependent transferases’ superfamilies. In eukaryotic protein structures, ‘protein-kinase’, ‘immunoglobulin’ and ‘C2H2 and C2HC zinc fingers’ superfamilies, which appear specifically in eukaryotic genomes, follow the top.

Differences in distribution of frequency of templates in different kingdoms of life are evident, when frequencies in use of template are plotted in descending order (Fig. 3). In any kingdoms of life, the frequencies of the most and the second most used templates exceed those of the remaining templates. The frequencies of templates in use drops first in archaebacterial protein structures and then in eubacterial protein structures. The descending curve of eukaryotic template frequency is less steep compared with the others, indicating that one template can produce a large number of domain 3D structures in eukaryotic ORFs. In other words, a significant number of proteins encoded in eukaryotic genomes are originated by domain duplication, as Koonin et al. (2000) demonstrated. Superfamilies with the 3D structures and with many copies in eukaryotic genomes, but seldom in prokaryotic genomes are ‘protein kinase-like’, ‘immunoglobin’, ‘RNA-binding domain’, ‘C2H2 and C2HC zinc fingers’, ‘WD40-repeat’, ‘glucocorticoid receptor-like’, ‘homeodomain-like’, ‘PH domain-like’, ‘RING-box’, ‘L domain’, ‘ankyrin repeat’, ‘ARM repeat’, ‘cytochrome P-450’ and ‘EF-hand’ superfamilies. These superfamilies are transcription factors, protein–protein interaction mediators and response factor for toxic substances, mostly known to be unique to eukaryotes.

Fig. 3
figure 3

Frequency of template usage in descending order. Horizontal axis is a template and the vertical axis is a frequency of templates in use. Red line is a template usage in archaebacteria, blue line is eubacteria and green line is eukaryotes

The ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily outnumbering other superfamily in template frequency corresponds to the previous finding that the enzyme is highly frequently used in every kingdom of life (Leipe et al. 2003). When biological functions of these ORFs with the 3D structure of ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily are classified, about half of the proteins are ABC transporters in archaebacterial and eubacterial proteomes, but numbers of G-proteins and motor proteins in eukaryotic proteomes are noticeable (Fig. 4).

Fig. 4
figure 4

Protein family distribution of ‘P-loop containing nucleoside triphosphate hydrolases’ superfamily in each kingdom. In the three pie charts, the section with the same color is a category of the same family except for the white section

In the last two years, new protein structures were determined and contributed to an increase in the number of templates for homology modeling. A part of those template structures are listed in Table 2. A part of those top 15 templates contributed a lot for the growth of modeled 3D structure database. In Table 2, 3D structure derived from SG projects is rare. The ratio of SG products in Table 2 is the same as that in PDB (Editorial Board, Nature Structural & Molecular Biology 2004). As the SG projects in US and Europe have proceeded to phase 2 (Service 2005), SG products are expected to contribute to increase in the number of templates in the near future. The qualities of protein 3D structures, namely, size, resolution, R-factors and so forth, derived from SG projects were compared with those in PDB and no obvious compromise in quality of SG products were found (Todd et al. 2005). The quality of homology modeling based on products of SG projects in the future, therefore, will be expected to be no less than the current quality.

Table 2 Top 15 modeling templates in the newly determined 3D structures between 2002 and 2003

Whole structure and function of proteins from homology modeling of domain structures

Protein function prediction, especially studies on enzyme specificity, based on homology modeling structures is intensively carried out in the field of drug design and related fields (Goldsmith-Fischman and Honig 2003; Kopp and Schwede 2004). Those studies are mostly based on homology modeling of domain structures. As mentioned above, most of the eukaryotic protein structures in FAMSBASE are 3D structures of structural domains, not the entire coding regions (Fig. 1). Protein functional sites are often located at a cleft of domains (Laskowski et al. 1996), and therefore understanding relative location of domains will be a critical issue. Xie and Bourne (2005) and O’Toole et al. (2003) also pointed out this problem and mentioned, “even if all the domains of a multiple-domain query sequence have determined structures, the individual structures will not enable accurate modeling of how they associate together in the structure of the entire proteins (O’Toole et al. 2003).”

Figure 5 shows all eukaryotic ORFs whose 3D structures are mostly modeled in pieces. There are three types of enzymes and four types of cell surface receptors. A protein structure of ENSP00000264705 which is an ORF found in human genome can be modeled based on Escherichia coli carbamoylphosphate synthetase (CPS) and Pyrococcus abyssi asparatate transcarbamoylase (ATC). E. coli CPS is composed of a large subunit and a small subunit. CPS and ATC are the first and the second enzymes, respectively, in pyrimidine biosynthesis pathway. In mammalian genomes, those proteins are coded by a single gene and active in a hexamer form (Serre et al. 2004). Interactions between the large subunit domain and the small subunit domain of human CPS are conjectured to be the same as those between the large and the small subunits of E. coli CPS. N-terminal residues of the large subunit and the C-terminal residues of the small subunit are spatially located close in E. coli CPS, which permits the two chains to be chemically connected without disrupting subunit interfaces. To be active, human CPS should form a hexamer supramolecule and the interfaces for the supramolecule formation should be predicted from the modeled 3D structures. At the moment, the interfaces are unknown.

Fig. 5
figure 5

Eukaryotic ORFs with multiple model structures covering more than 70% of entire protein. In each of the bar representation of proteins, a black box is a region with 3D structure. A name and PDB ID of a template structure and amino acid sequence identity between template and target domains are given below the black box. A yellow box is a putative signal peptide and green box is a putative transmembrane region. Template and modeled structures of ENSMUSP00000019416 were shown on the right side of the figure. Each domain is colored by hydrophobicity. A hydrophilic residue is in green and a hydrophobic residue is in red. A buried residue is in deep blue

ENSMUSP00000019416 is an ORF found in mouse genome and encodes a putative cell surface receptor. The protein is predicted to consist of six consecutive Ig-fold domains. There is a putative transmembrane helix at the C-terminal region of the protein. Two consecutive Ig-fold domains are modeled without overlap, and no pieces of information for relative orientation of three modeled structures have been found. Information of interaction sites of those domains is required to build the entire structure of the protein and to predict a target molecule of this receptor. Computational analyses of domain interfaces and of protein–protein interfaces have been targets for extensive study for a long time, and some general characteristics have been found. One of them is the hydrophobicity of the interfaces (Wodak and Janin 2002). Hydrophobic clusters on the surface of modeled structures of ENSMUSP00000019416 are shown in right side of Fig. 5. One of the template structures, Nkp46 ectodomain, has hydrophilic surface (green) around the C-terminal residues of the domain, however the modeled structure has a hydrophobic surface (orange) at the corresponding area. The other template structure, LIR-1 D1D2, has a hydrophilic surface around the N-terminal residues of the domain, however the modeled structure has a hydrophobic surface at the corresponding area. The surfaces uniquely turned into hydrophobic in modeled structures are close to the residues that are chemically bonded in the target protein, and therefore both of the areas likely form interfaces of the two domains. The modeled structure based on LIR-1 D1D2 domain has another hydrophobic surface around the C-terminal residues, which may interact with CD158j-like domain located at the C-terminal side of the domain.

Accuracy of homology modeling

There are at least three major issues that affect accuracy in homology modeling; the best template selection, accuracy of an amino acid sequence alignment between template and target protein sequences and the accuracy of structure building procedure itself (Contreras-Moreira et al. 2005). Accuracy of the alignment is high, when sequence identity of template and target proteins is higher than 30%, and alignment of proteins with identity less than 30% is known to be less reliable, thereby accuracy of homology modeling deteriorates (Kopp and Schwede 2004). FAMS has been shown to construct relatively accurate model structures, even with low sequence identity between template and target sequences in CAFASP2, the homology modeling competition (Iwadate et al. 2001; Yamaguchi et al. 2003). A distribution of sequence identity between amino acid sequences of template and target proteins in FAMSBASE is shown in Fig. 6. Half of the model structures in FAMSBASE rely on alignments of sequence identity less than 20%. Figure 6 suggests that the current 3D structure database does not contain good enough structures for high quality homology modeling. SG projects will eventually provide better template structures, and improvement in target selection, alignment and modeling methods are also in pursuit to overcome the difficulties in homology modeling (John and Sali 2003; Wallace et al. 2005).

Fig. 6
figure 6

Distribution of sequence identity between template and target amino acid sequences in FAMSBASE

Conclusion

Construction of database of whole genome homology modeling clarified that protein 3D structures of about 50% of the protein coding regions in whole genome can now be modeled. Maintaining the current speed of 3D structure determination, it will take, at most, 11 years to have enough templates to cover whole soluble proteins of eubacterial genomes, and 25 years to cover those of eukaryotic genomes. The current advancement in technologies of protein structure determination is expected to make these due times closer to the present. What we obtain at those times are not the 3D structures of entire proteins, but domain structures in pieces. A homology modeled domain structure is now in use of predicting domain functions, but predicting spatial arrangement of domains in a protein will be an important issue for function prediction.