Homology modelling and spectroscopy, a never-ending love story
- First Online:
- Cite this article as:
- Venselaar, H., Joosten, R.P., Vroling, B. et al. Eur Biophys J (2010) 39: 551. doi:10.1007/s00249-009-0531-0
- 960 Downloads
Homology modelling is normally the technique of choice when experimental structure data are not available but three-dimensional coordinates are needed, for example, to aid with detailed interpretation of results of spectroscopic studies. Herein, the state of the art of homology modelling will be described in the light of a series of recent developments, and an overview will be given of the problems and opportunities encountered in this field. The major topic, the accuracy and precision of homology models, will be discussed extensively due to its influence on the reliability of conclusions drawn from the combination of homology models and spectroscopic data. Three real-world examples will illustrate how both homology modelling and spectroscopy can be beneficial for (bio)medical research.
Knowledge of the three-dimensional structure of proteins is a prerequisite for much research in fields as diverse as protein engineering, human genetics and drug design. Only two spectroscopic techniques, nuclear magnetic resonance (NMR) and X-ray, can produce high-resolution three-dimensional coordinates of macromolecules. Most other spectroscopic techniques either add information to such three-dimensional coordinates, or require these coordinates for detailed interpretation of their results. NMR and X-ray are very elaborate techniques, and worldwide only about 30 protein structures are solved per day. In the time needed to read the above abstract, on the other hand, about 50 sequences were determined (worldwide) and deposited in international, freely and easily accessible sequence databases. Consequently, the necessity for homology modelling is only increasing.
In its most elementary form, homology modelling involves calculating the structure of a protein for which only the sequence is known using its alignment with a homologous protein for which the structure is known.
The first homology modelling articles were published as early as the late 1970s (Greer 1980), and since then we have kept using and improving the same concepts described in those ground-breaking articles. The process starts with the detection of a suitable template; an alignment is produced; insertions, deletions and residue substitutions are performed; the model is optimized; and since the late 1990s there is consensus that structure validation is needed to detect the unavoidable errors in the final model.
In the early 1990s many homology models were built (and unfortunately also published; also in EBJ) just for the sake of modelling, but since the mid 1990s homology models are considered tools that can aid with the design of experiments and with the interpretation of their results, although occasionally things can still go very wrong in the literature.
The following series of short paragraphs illustrate the mutual relation between spectroscopy and homology modelling. Most examples are drawn from the GPCR research field. These examples are just illustrations and neither imply a judgment nor pretend completeness.
Exposed residue labelling
In two studies Davison and Findlay (1986a, b) identified residues that were exposed to the membrane environment or the retinal binding site, respectively, by labeling opsin with photoactivated l-azido-4-iodobenzene. This study was carried out long before the first crystal structure of any GPCR became available, and determining which residues were labelled allowed Davison and Findlay to get a more complete picture of the three-dimensional organisation of the opsin molecule.
Site-directed spin labelling
The group of Khorana used site-directed spin labelling to analyse the structure and light-dependent changes of part of GPCRs. The electron paramagnetic resonance (EPR) spectrum of each spin-labelled mutant was analysed in terms of residue accessibility and mobility. In the article where they describe the region extending from helix VII to the palmitoylation sites in the rhodopsin molecule (Altenbach et al. 1999; Cai et al. 1999) they concluded that this region had extensive tertiary interactions. After Oliveira et al. (1999)—correctly—modelled this region as a helix that runs parallel to the membrane, the interpretation of the EPR data could be extended significantly, indicating how modelling can help interpret spectroscopic measurements.
Fluorescence resonance energy transfer
Turcatti et al. (1996) studied ligand–receptor interactions in the neurokinin-2 receptor (NK2) using fluorescence resonance energy transfer (FRET). A fluorescent unnatural amino acid was introduced at known sites into NK2. Inter-molecular distances were determined by measuring the fluorescence resonance energy transfer between the fluorescent unnatural amino acids and a fluorescently labelled NK2 heptapeptide antagonist. A similar approach was used to measure distances between the cholecystokinin receptor and a natural agonist (Harikumar et al. 2002) and between the secretin receptor and secretin analogues (Harikumar et al. 2007). Distances obtained were used as constraints to improve models for ligand–receptor interactions. The NK2 results were interpreted in terms of an obviously very poor bacteriorhodopsin structure. Looking at their results more than 10 years later, and with four GPCR structures at hand, we can see that they located the NK2 ligand largely at the correct place. This tells us how to trust or distrust the secretin FRET results and illustrates how spectroscopy can help improve modelling.
Fourier-transform infrared spectroscopy
Molecular models of rhodopsin based upon electron density projection maps that were constructed before the first crystal structure became available proposed a specific interaction between transmembrane (TM) helices III and V, which appeared to be mediated by amino acid residues Glu122 and His211 on TM helices III and V, respectively. Beck et al. (1998) used a combination of site-directed mutagenesis and Fourier-transform infrared spectroscopy (FTIR) to validate this hypothesis.
Small-angle scattering has occasionally been used to assist with homology modelling of water-soluble proteins. Mascarenhas et al. (1992) studied crotoxin. The sequence identity with a template structure was generally high enough to build a good homology model, but the structure of one large loop remained highly ambiguous. Small-angle neutron scattering data corresponded much better with one of the two models made, thus solving this problem. Comoletti et al. (2007) studied the structure of the neuroligins and their complex with neurexin using small-angle neutron scattering and small-angle X-ray scattering. A high-resolution structure of the neurexin was available but no structure was available for the neuroligins. However, the neuroligins have significant sequence similarity with acetylcholine esterase, making it possible to build a homology model. Scattering from neuroligin constructs was similar to that previously obtained from acetylcholine esterase structures (Marchot et al. 1996), indicating that the homology model was valid.
Turcatti et al. (1995) studied the NK2 receptor using a number of fluorescent ligands, differing only in the length of the spacer between the fluorescent probe and the peptide ligand. By analyzing the different levels of fluorescence related to the spacer lengths they found that the binding pocket of the NK2 receptor was buried at a depth of 5–10 Å.
Modelling and spectroscopy of the M13 protein
The final example to illustrate the never-ending love story of modelling and spectroscopy was recently reviewed by the Hemminga group. In a beautiful review (Vos et al. 2009) entitled “From ‘I’ to ‘L’ and back again: the odyssey of membrane-bound M13 protein” they illustrate how a large series of spectroscopic techniques have been employed worldwide over a period of more than 20 years to continuously update the structure model of the M13 protein in its membrane-bound form. The (mainly spectroscopic) techniques used during this whole odyssey include NMR, site-specific and other solid-state NMR, X-ray fibre diffraction, cryo-electron microscopy, site-directed spin labelling and site-directed introduction of fluorescence probes, similar to that described above for the GPCR studies, fluorescence energy transfer, site-specific infrared dichroism etc. Throughout this odyssey of Hemminga and others, homology modelling and spectroscopy were both applied. This example nicely illustrates the importance of homology modelling for spectroscopy, and vice versa. In some cases the spectroscopic results triggered the need to analyse the model, while in other cases the model suggested the spectroscopic experiments, and in some cases both went hand in hand. Throughout this odyssey, the model improved and the spectroscopy became more sophisticated.
The M13 odyssey also shows that spectroscopic techniques developed over the years, allowing for continuously more precise and more accurate measurements. At this same time homology modelling improved too. In the next paragraphs we will discuss the latest developments in homology modelling and its remaining unsolved problems. We will also illustrate the usefulness of homology models, sometimes in combination with spectroscopic techniques, with a few real-world examples.
Method overview: the eight steps of homology modelling
Step 1: Template recognition and initial alignment
Traditionally, the modelling template is found by performing a BLAST (Altschul et al. 1990) search against the Protein Data Bank (PDB) (Berman et al. 2003). This approach is often successful when the query is highly similar to structures in the database. In contrast, templates that are close to the possible homology modelling threshold are harder to find or may even remain undetected (Sander and Schneider 1991). The development of PSI-BLAST (Altschul et al. 1997) and fold-recognition methods (e.g. Jones 1999) improved the detection of these difficult-to-find templates because these methods use a profile instead of a single sequence to search the database. Furthermore, the growing number of structures collected in the PDB makes it every year easier to find a homologous one.
Finding the best possible template is not limited to searching the PDB with (PSI-)BLAST. There may be several candidate templates with similar sequence identities to the query. In that case, the optimal template must be selected based on other criteria. The X-ray resolution is, although frequently used as such, only a limited measure of structure quality because it says something about the experimental data, not about the quality of the structure model. I.e. with 1.8-Å data one can typically make a better structure model than with 2.2-Å data, but whether or not this better structure model is actually made depends on the crystallographer and the software used.
The crystallographic residual, the so-called R-factor, says something about the correlation of the structure model with the experimental data and therefore seems more indicative than the X-ray resolution alone. Unfortunately, acceptable and even seemingly encouraging R-factors can be attained by adding more parameters to the structure models, effectively over-fitting/over-refining the model (Brändén and Jones 1990). This problem was solved by the introduction of the free R-factor (Brünger 1992), which is much more robust against over-fitting because the value is calculated only with the fraction of the X-ray data that was not used to build the structure model. Therefore, R-free can be seen as a description of how well the structure model predicts an ‘independent’ measurement.
While very robust, the free R-factor is a global indicator: it describes the structure model as a whole, but not a particular part of the protein that may interest us. A local measure of fit with the experimental data is needed, for instance, the real-space R-factor (Jones et al. 1991). For PDB entries these values can be obtained from the electron density server (EDS) (Kleywegt et al. 2004).
So, with the X-ray resolution, the (free) R-factor and the real-space R-factor, one can select a proper template from the (PSI-)BLAST results. That is, one can select a template that corresponds well with the X-ray experiment. A more in-depth analysis of the template structure is needed to see whether it also corresponds to our current knowledge of protein structures. Structure validation scores such as the Ramachandran Z-score (Hooft et al. 1997), the fraction of Ramachandran plot outliers (Laskowski et al. 1993), side-chain rotamer normality scores (Hooft et al. 1996a), residue packing scores (Vriend and Sander 1993), hydrogen-bond network quality (Hooft et al. 1996b) and many others are used to obtain insight into the geometric quality of the template structure. Most of these validation scores, both global and local, can be obtained via the PDB and the linked databanks.
Another possible step in template selection is optimisation of the template before the actual modelling. We have recently shown that validation scores such as the Ramachandran Z-score and the number of atomic clashes (bumps) can be improved by a fully automated re-refinement of the PDB entry with its original experimental data (Joosten et al. 2009). In addition, the crystallographic R-factor, or rather the free R-factor, is also improved by this re-refinement. This optimisation is particularly useful for templates that will be used for drug docking studies because their success often depends critically on the quality of the atomic model. The benefit of re-refinement is tightly correlated with sequence identity between the template and the model sequence. That is, any improvement of the atomic coordinates of a residue is lost when this residue (or just its side-chain) has to be rebuilt. Fortunately, even with low sequence identity, there may be regions of the template that are not changed in the modelling process and thus can be improved by re-refinement.
Step 2: Alignment correction
Having identified one or more possible modelling templates using the initial screening described above, more sophisticated methods are needed to arrive at better alignment. Molecular class-specific information systems (MCSISs) can be a great asset in the homology modelling process. MCSISs contain a large amount of heterogeneous data on one particular class of proteins. The GPCRDB (http://www.gpcr.org/7tm/; Horn et al. 2003) is a good example of such a system. It contains sequences, structures, mutation data, ligand binding data and much more. All of this information is used (directly or indirectly) for creating sequence profiles for GPCR (sub-)families. Profile-based alignments (Oliveira et al. 1993) are used to generate alignments that are of significantly higher quality than alignments generated by automatic methods based solely on sequence data. Currently, MCSISs are available for only a small number of protein families and therefore in most cases other sequence alignment tools are needed.
Many programs are available to align a number of related sequences, for example, CLUSTALW (Thompson et al. 1994). The resulting multiple sequence alignment implicitly contains a lot of additional information. For example, if at a certain position only exchanges between hydrophobic residues are observed, it is highly likely that this residue is buried. Multiple sequence alignments are also useful to place deletions or insertions in areas where the sequences are strongly divergent. To consider this knowledge during the alignment, one uses the multiple sequence alignment to derive position-specific scoring matrices, which are also called profiles (e.g. Taylor 1986; Dodge et al. 1998). In recent years, new programs such as MUSCLE and T-Coffee have been developed that use these profiles to generate and refine the multiple sequence alignments (Edgar 2004; Notredame et al. 2000).
Step 3: Backbone generation
Step 4: Loop modelling
Step 5: Side-chain modelling
Step 6: Model optimisation
Step 7: Model validation
Step 8: Iteration
If the model is not good enough, (part of) the modelling process has to be repeated. For instance, wrong side-chain conformations can be improved by iterating the process from step 5 onwards. Sometimes, this iteration step means that one has to start the modelling process all over again using another template or alignment. Alternatively, one can start several modelling processes using different templates. The resulting models can be combined in the end to produce a hybrid model that consists of the strongest points of each separate model.
Unsolved problems and future directions in homology modelling
While many scientific disciplines face huge difficulties when trying to experimentally validate theoretical predictions, protein modelling is in a fortunate situation: since 1994, the biennial Critical Assessment of Structure Prediction (CASP) contests (Moult et al. 2007) have provided an ideal opportunity to evaluate the accuracy of today’s many protein structure prediction methods. During each CASP season (lasting about 4 months once every 2 years), about 200 research groups try to predict the structures of ~100 proteins, the CASP targets. The target sequences are provided to CASP by structural biology laboratories just before the corresponding structures are solved. The predictions are thus true blind predictions, allowing performance to be measured on realistic test cases, locating areas of progress as well as still unsolved problems.
The only way to handle these difficult cases is to apply more general ab initio folding algorithms, which do not depend on template structures, but try to simulate the complete folding process from the stretched-out conformation. As it turns out, this one-algorithm-fits-all approach is currently the most successful one at CASP (Chivian et al. 2003; Pandit et al. 2006): if available, it uses known templates (or fragments thereof) only to guide the search, but does not depend on them. As a side-effect, this allows hybrid models to be built, combining the best parts from multiple templates.
Despite these encouraging developments, the protein folding problem is far from solved. The best models are still built by those who got the alignment right in the first place, which unfortunately implies that structural diversity is often missed: one cannot yet ignore the difficult-to-align regions and simply predict them with ab initio folding instead. The sequence alignment problem will thus remain an active research field for years to come.
Noteworthy progress has been made with model optimisation to bridge the structural gap between initial model and target. While in the early days of CASP, predictors were well advised to keep the backbone of their model fixed (the frozen-core approach), simply because the danger of messing up the model was too large, the situation is quite different today: force field accuracy (Krieger et al. 2004) and sampling efficiency (Misura et al. 2006) have improved to a level that allows well-performing methods such as Modeller-CSA (Joo et al. 2008), Rosetta (Chivian et al. 2003), undertaker (Vriend 1990) and YASARA (Krieger et al. 2009) to free all atoms during the refinement, often moving models considerably closer to the target.
While homology modelling currently focusses on the protein in a model, other entities, i.e. carbohydrates, small molecules and ions, also make up important parts of certain proteins and protein complexes; for instance, zinc atoms in so-called zinc fingers are important for the stability of the protein, and the common protein haemoglobin would be useless without its heme groups and the iron atoms therein. Carbohydrates in glycoproteins perform numerous functions, ranging from providing stability to signalling and labelling for intracellular transport (Lütteke 2009). The many roles of non-protein entities make it obvious that homology modelling should look beyond the protein. A complete model should thus be more than a three-dimensional (3D) representation of an amino acid sequence. One major challenge for homology modelling is recognising binding sites for non-protein entities.
Drug docking software (e.g. Rarey et al. 1996; Nabuurs et al. 2007) can be used to detect the binding sites of compounds such as heme groups or coenzymes. However, relevant biological information is needed to select compounds that may be bound to the protein. Copying the binding site from the template structure is the simplest method, but does not work for ab initio folding models. For such models, spectroscopic analysis of the protein can provide insight into which compounds are bound. This approach is not limited to homology modelling; X-ray crystallography can also benefit from spectroscopic analysis of a protein to identify a bound compound (Chen et al. 2002).
Incorporating ions can be an additional step of the modelling process. Nayal and Di Cera (1996) have suggested a method to detect sodium binding sites in protein structures which can be extended to detect various other ion binding sites. Of course, any additional experimental data can guide this ion-site detection process. Especially tightly bound functional ions that co-purify with the protein can be detected by means of spectroscopic analysis. A significant number of PDB files have bound ions or water molecules that were erroneously assigned. We have observed H20, and Na+, K+, Mg2+, Ca2+ and NH4+ ions that should actually be one of the others in this list. This is the result of X-ray spectroscopy having difficulties distinguishing between H2O, NH4+, Na+ and Mg2+ because these entities have equally many electrons, as do K+ and Ca2+.
The power of force-field-based model optimisation methods can be significantly reduced when such problems include a difference in the ionic charge. It is therefore very important to (experimentally) validate the ions in template structures when these are important for the final homology model.
Results and discussion: other roles for homology modelling
In the 1990s most articles that included homology modelling described just how the model of one protein was constructed, and ended with the ominous sentence “…this model will help us perform our research on this intriguing protein”, after which the group would start working on something else. Here, we will illustrate the importance of homology modelling for the study of inheritable diseases, but the value of models has been amply illustrated in fields ranging from drug design to laundry powder enzyme engineering, from validating experimental structures to the design of humanized antibodies, and from mutation analysis to intelligent experimental design in many spectroscopic research projects.
In the following examples building the homology model was not the ultimate goal but one of the tools used to gain more information about a mutation, a disease or a process in the human body. These examples prove that homology models can be of great use in the (bio)medical field.
Modelling of the LRTOMT-COMT domain
In a study of non-syndromic deafness four pathogenic mutations were found in an as-yet uncharacterized gene which codes for two different proteins, called LRTOMT1 and LRTOMT2 (Ahmed et al. 2008). Three of these mutations were mis-sense mutations located in the catechol-O-methyltransferase (COMT, EC 184.108.40.206) domain of LRTOMT2; one introduced a stop codon causing the loss of a large fraction of the protein. The COMT domain catalyses the transfer of a methyl group from S-adenosyl-l-methionine (Ado-Met) to a hydroxyl group of catechol. No structure for LRTOMT2 was known, so we needed a homology model to study the mis-sense mutations in more detail.
Modelling of the ligand-binding domain of ESRRB
Sequence analysis revealed four mis-sense mutations of the estrogen-related receptor beta (ESRRB) gene leading to autosomal-recessive non-syndromic hearing impairment (Collin et al. 2008). Experimental results indicated that ESRRB is essential for inner-ear development and function. ESRRB encodes the estrogen-related receptor protein beta, a member of the nuclear hormone receptor (NHR) family. In general, members of this family share a zinc finger C4 DNA-binding domain at their N-terminus and a ligand-binding domain that is located near the C-terminus. The ligand-binding domain of nuclear hormone receptors is a well-conserved and highly organized structure containing 12 alpha-helices. Three mutations (p.L320P, p.V342L and p.L347P) were located within this ligand-binding domain. The fourth mutation was located in the DNA binding domain and will not be discussed here.
In summary, the molecular modelling data predicts that mutations of ESRRB will result in conformational changes near the substituted amino acids or decreased helix stability and are therefore likely affect the stability and function of the complete ligand-binding domain.
Functional states of alcohol oxidase
Homology modelling will always be needed because it is impossible to solve the three-dimensional structure for each determined sequence. An increasing number of scientists are now using protein structures for mutational analysis and experimental design. The process of homology modelling has improved dramatically over the years, but there are still many problems to solve. It is reassuring for homology modellers that modelling problems can often be solved using spectroscopic techniques. Spectroscopists, on the other hand, normally need homology models to fully harvest the results from their spectroscopic studies. This interplay brings us back to the title of this article: homology modelling and spectroscopy are indeed a never-ending love story.
This work was part of the BioRange programme of the Netherlands Bioinformatics Centre (NBIC), which is supported by a BSIK grant through the Netherlands Genomics Initiative (NGI). TIPharma is thanked for support. This work was funded in part by the EU project EMBRACE Grid which is funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LUNG-CT-2004-512092. This work was also supported by the BioSapiens Network of Excellence project. The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area ‘Life sciences, genomics and biotechnology for health’, contract number LSHG-CT-2003-503265. G.V. wishes to thank M.A. Hemminga for being a warm-hearted friend and a great Ph.D. supervisor.
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.