Protein Structure Modeling with MODELLER
Genome sequencing projects have resulted in a rapid increase in the number of known protein sequences. In contrast, only about one-hundredth of these sequences have been characterized using experimental structure determination methods. Computational protein structure modeling techniques have the potential to bridge this sequence-structure gap. This chapter presents an example that illustrates the use of MODELLER to construct a comparative model for a protein with unknown structure. Automation of similar protcols has resulted in models of useful accuracy for domains in more than half of all known protein sequences.
The function of a protein is determined by its sequence and its three-dimensional (3D) structure. Large-scale genome sequencing projects are providing researchers with millions of protein sequences, from various organisms, at an unprecedented pace. However, the rate of experimental structural characterization of these sequences is limited by the cost, time, and experimental challenges inherent in the structural determination by x-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.
In the absence of experimentally determined structures, computationally derived protein structure models are often valuable for generating testable hypotheses (1). Comparative protein structure modeling has been used to produce reliable structure models for at least one domain in more than half of all known sequences (2). Hence, computational approaches can provide structural information for two orders of magnitude more sequences than experimental methods, and are expected to be increasingly relied upon as the gap between the number of known sequences and the number of experimentally determined structures continues to widen.
MODELLER is a computer program for comparative protein structure modeling (4,5). In the simplest case, the input is an alignment of a sequence to be modeled with the template structure(s), the atomic coordinates of the template(s), and a simple script file. MODELLER then automatically calculates a model containing all non-hydrogen atoms, without any user intervention and within minutes on a desktop computer. Apart from model building, MODELLER can perform auxiliary tasks such as fold-assignment (6), alignment of two protein sequences or their profiles (7,8), multiple alignment of protein sequences and/or structures (9), clustering of sequences and/or structures, and ab initio modeling of loops in protein structures (4).
MODELLER implements comparative protein structure modeling by satisfaction of spatial restraints that include: (1) homology-derived restraints on the distances and dihedral angles in the target sequence, extracted from its alignment with the template structures (5); (2) stereochemical restraints such as bond length and bond angle preferences, obtained from the CHARMM-22 molecular mechanics force-field (10); (3) statistical preferences for dihedral angles and non-bonded interatomic distances, obtained from a representative set of known protein structures (11,12); and (4) optional manually curated restraints, such as those from NMR spectroscopy, rules of secondary structure packing, cross-linking experiments, fluorescence spectroscopy, image reconstruction from electron microscopy, site-directed mutagenesis, and intuition (see Fig. 8.1). The spatial restraints, expressed as probability density functions, are combined into an objective function that is optimized by a combination of conjugate gradients and molecular dynamics with simulated annealing. This model building procedure is similar to structure determination by NMR spectroscopy.
This chapter uses a sequence with unknown structure to illustrate the use of various modules in MODELLER to perform the four steps of comparative modeling. This is followed by a Notes section that highlights several underlying practical issues.
A computer running Linux/Unix, Apple Mac OS X, or Microsoft Windows 98/NT/2000/XP; 512 MB RAM or higher; about 100 MB of free hard-disk space for the software, example and output files; and a connection to the internet to download the MODELLER program and example files described in this chapter (see Note 1).
The MODELLER 8v2 program, downloaded and installed from http://salilab. org/modeller/download_installation.html. Instructions for the installation are provided as part of the downloaded package; they are also available over the internet at http://salilab.org/modeller/release.html#install.
The files required to follow the example described in this chapter, downloaded and installed from http://salilab.org/modeller/tutorial/MMB06-example.tar.gz (Unix/Linux/MacOSX) or http://salilab.org/modeller/tutorial/MMB06-example.zip (Windows).
2.3. Computer Skills
MODELLER uses Python as its control language. All input scripts to MODELLER are, hence, Python scripts. Although knowledge of Python is not necessary to run MODELLER, it can be useful to perform more advanced tasks.
MODELLER does not have a graphical user interface (GUI) and is run from the command-line by executing the input scripts; a basic knowledge of command-line skills on a computer is necessary to follow the protocol described in this chapter.
2.4. Conventions Followed in the Text
A sequence with unknown structure, for which a model is being calculated, is referred to as the “target.”
A “template” is an experimentally determined structure, and/or its sequence, used to derive spatial restraints for comparative modeling.
Names of files, objects, modules, and commands to be executed are all shown in monospaced font.
Files with .ali extensions contain the alignment of two or more sequences and/or structures. Files with .pir extensions correspond to a collection of one or more unaligned sequences in the PIR format. Files with .pap extensions contain an alignment in a user-friendly format with an additional line indicating identical aligned residues with a *. All input scripts to MODELLER are Python scripts with the .py extension. Execution of these input scripts always produces a log file identified by the .log extension.
A typical operation in MODELLER would consist of: (1) preparing an input Python script; (2) ensuring that all required files (sequences, structures, alignments, etc.) exist; (3) executing the input script by typing mod8v2 <input-script>; and (4) analyzing the output and log files.
The procedure for calculating a three-dimensional model for a sequence with unknown structure is illustrated using the following example: A novel gene for lactate dehydrogenase (LDH) was identified from the genomic sequence of Trichomonas vaginalis (TvLDH). The corresponding protein had higher sequence similarity to the malate dehydrogenase of the same species (TvMDH) than to any other LDH (13). Comparative models were constructed for TvLDH and TvMDH to study the sequences in a structural context and to suggest site-directed mutagenesis experiments to elucidate changes in enzymatic specificity in this apparent case of convergent evolution. The native and mutated enzymes were subsequently expressed and their activities compared (13).
3.1. Fold Assignment
The first step in comparative modeling is to identify one or more template structure(s) that have detectable similarity to the target. This identification is achieved by scanning the sequence of TvLDH against a library of sequences extracted from known protein structures in the Protein Data Bank (PDB) (14). This step is performed using the profile.build() module of MODELLER (file build_profile.py) (seeNote 2). The profile.build() command uses the local dynamic programming algorithm to identify related sequences (6,15). In the simplest case, profile. build() takes as input the target sequence (file TvLDH.pir) and a database of sequences of known structure (file pdb_95.pir) and returns a set of statistically significant alignments (file build_profile.prf). Execute the command by typing mod8v2 build_profile.py.
The results of the scan are stored in the output file called build_profile.prf. The first six lines of this file contain the input parameters used to create the alignments. Subsequent lines contain several columns of data. For the purposes of this example, the most important columns are: (1) the second column, containing the PDB code of the related template sequences; (2) the 11th column, containing the percentage sequence identity between the TvLDH and template sequences; and (3) the 12th column, containing the E-values for the statistical significance of the alignments.
The extent of similarity between the target-template pairs is usually quantified using sequence identity or a statistical measure such as E-value (seeNotes 3 and 4). Inspection of column 11 shows that the template with the highest sequence identity with the target is the 1y7tA structure (45% sequence identity). Further inspection of column 12 shows that there are six PDB sequences, all corresponding to malate dehydrogenases (1y7tA, 5mdhA, 1b8pA, 1civA, 7mdhA, and 1smkA) that show significant similarities to TvLDH with E-values of zero. Two variations of the model building procedure will be described below, one using a single template with the highest sequence identity (1y7tA; 3.2.1), and another using all six templates (3.2.2), to highlight their differences (see Note 5).
3.2. Sequence-Structure Alignment
Sequence-structure alignments are calculated using the align2d() module of MODELLER (see Note 6). Although align2d() is based on a global dynamic programming algorithm (16), it is different from standard sequence-sequence alignment methods because it takes into account structural information from the template when constructing an alignment. This task is achieved through a variable gap penalty function that tends to place gaps in solvent exposed and curved regions, outside secondary structure segments, and between two positions that are close in space (9). In the current example, the target-template similarity is so high that almost any method with reasonable parameters results in the correct alignment (see Note 7).
The input script align2d-single.py reads in the structure of the chosen template (1y7tA) and the target sequence (TvLDH) and calls the align2d() module to perform the alignment. The resulting alignment is written out to the specified alignment files (TvLDH-1y7tA.ali in the PIR format and TvLDH-1y7tA.pap in the PAP format).
The first step in using multiple templates for modeling is to obtain a multiple structure alignment of all the chosen templates. The structure alignment module of MODELLER, salign(), can be used for this purpose (17). The input script salign.py contains the necessary Python instructions to achieve a multiple structure alignment. The script reads in all the six template structures into an alignment object and then calls salign() to generate the multiple structure alignment. The output alignment is written out to TvLDH-salign.ali and TvLDH-salign.pap, in the PIR and PAP formats, respectively.
The next step is to align the TvLDH sequence with the multiple structure alignment generated in the preceding step. This task is accomplished using the script file align2d-multiple.py, that again calls the align2d() module to calculate the sequence-structure alignment. Upon execution, the resulting alignments are written to TvLDH-multiple.ali and TvLDH-multiple.pap in the PIR and PAP formats, respectively.
3.3. Model Building
Two variations of the model building protocol are described, corresponding to the two alignments generated in the preceding: (1) modeling using a single template; and (2) modeling using multiple templates, followed by building and optimizing a consensus model. The files required for each of these protocols are present in separate subdirectories called single/ and multiple/, respectively.
3.3.1. Single Template
The input script model-single.py lists the Python commands necessary to build the model of the TvLDH sequence using the information derived from 1y7tA structure. The script calls the automodel class specifying the name of the alignment file to use and the identifiers of the target (TvLDH) and template sequences (1y7tA). The starting_model and ending_model specify that 10 models should be calculated using different randomizations of the initial coordinates. The models are then assessed with the GA341 (18,19) and DOPE assessment functions (12).
Upon completion, the 10 models for the TvLDH are written out in the PDB format to files called TvLDH.B9990[0001–0010].pdb (see Notes 8 and 9).
3.3.2. Multiple Templates with Consensus Modeling
The input script, model-multiple.py, is quite similar to model-single. py. The specification of the template codes to automodel now contains all six chosen PDB codes and additionally, the cluster() method is called to exploit the diversity of the 10 generated models via a clustering and optimization procedure to construct a single consensus model (see Note 10).
Upon completion, the 10 models for the TvLDH and the consensus model are written out to TvLDH.B9990[0001–0010].pdb and cluster. opt, respectively.
3.4. Model Evaluation
The log files produced by each of the model building procedures (model-single.log and model-multiple.log) contain a summary of each calculation at the bottom of the file. This summary includes, for each of the 10 models, the MODELLER objective function (see Note 11) (5), the DOPE pseudo-energy value (see Note 12), and the value of the GA341 score (see Notes 13 and 14). These scores can be used to identify which of the 10 models produced is likely to be the most accurate model (see Note 15).
A residue-based pseudo-energy profile for the best scoring model, chosen as the one with the lowest DOPE statistical potential score, can be obtained by executing the evaluate_model.py script. This script is available in each of the subdirectories mentioned above. Such a profile is useful to detect local regions of high pseudo-energy that usually correspond to errors in the model (see Notes 16 and 17). Figure 8.2 shows the pseudo-energy profiles of the best scoring models from each procedure. It can be seen that some of the errors in the single-template model have been resolved in the model calculated using multiple templates.
- 1.Exactly the same job run on two different types of computers (e.g., Windows/Intel and a Macintosh) generally returns slightly different results. The reason for this variation is the difference in the rounding of floating point numbers, which may lead to a divergence between optimization trajectories starting at exactly the same initial conditions. Although these differences are generally small, for absolute reproducibility, the same type of computer architecture and operating system need to be used.
- 2.As mentioned, knowledge of the Python scripting language is not a requirement for basic use of MODELLER. The lines in the script file are usually self-explanatory and input/output options for each module are described in the manual. For the purpose of illustration, the various lines of the build_ profile.py script are described below (Fig. 8.3):
• log.verbose() sets the amount of information that is written out to the log file.
• environ() initializes the “environment” for the current modeling run, by creating a new environ object, called env. Almost all MODELLER scripts require this step, as the new object is needed to build most other useful objects.
• sequence_db() creates a sequence database object, calling it sdb, which is used to contain large databases of protein sequences.
• sdb.read() reads a file, in text format, containing non-redundant PDB sequences into the sdb database. The input options to this command specify the name of the file (seq_database_file=“pdb_95.pir”), the format of the file (seq_0database_format=“pir”), whether to read all sequences from the file (chains_list=“all”), upper and lower bounds for the lengths of the sequences to be read (minmax_db_seq_len=(30,3000)), and whether to clean the sequences of non-standard residue names (clean_ sequences=True).
• sdb.write() writes a binary machine-independent file (seq_data-base_format=“binary”) with the specified name (seq_database_ file=“pdb_95.bin”), containing all sequences read in the previous step.
• The second call to sdb.read() reads the binary format file back in for faster execution.
- • alignment() creates a new “alignment” object (aln).
• aln.append() reads the target sequence TvLDH from the file TvLDH. ali and aln.to_profile() converts it to a profile object (prf). Profiles contain similar information as alignments, but are more compact and better suited for sequence database searching.
• prf.build() searches the sequence database (sdb) using the target profile stored in the prf object as the query. Several options, such as the parameters for the alignment algorithm (matrix_offset, rr_file, gap_penalties, etc.), are specified to override the default settings. max_aln_evalue specifies the threshold value to use when reporting statistically significant alignments.
• prf.write() writes a new profile containing the target sequence and its homologs into the specified output file (file=build_profile.prf ).
• The profile is converted back to the standard alignment format and written out using aln.write().
- 3.Sequence-structure relationships can be divided into three different regimes of the sequence similarity spectrum: (1) the easily detected relationships characterized by >30% sequence identity; (2) the “twilight zone” (20) corresponding to relationships with statistically significant sequence similarity, with identities generally in the 10–30% range; and (3) the “midnight zone” (20) corresponding to statistically insignificant sequence similarity. Hence, the sequence identity is a good predictor of the accuracy of the final model when its value is greater than 30%. It has been shown that models based on such alignments usually have, on average, more than approximately 60% of the backbone atoms correctly modeled with a root-mean-squared-deviation (RMSD) of less than 3.5 Å (Fig. 8.4).
However, the sequence identity is not a statistically reliable measure of alignment significance and corresponding model accuracy for values lower than 30% (20,21). During a scan of a large database, for instance, it is possible that low values occur purely by chance. In such cases, it is useful to quantify the sequence-structure relationship using more robust measures of statistical significance, such as E-values, that compare the score obtained for an alignment with an established background distribution of such scores.
One other problem of using sequence identity as a measure to select templates is that, in practice, there is no single generally accepted way to normalize it (21). For instance, local alignment methods usually normalize the number of identically aligned residues by the length of the alignment, whereas global alignment methods normalize it by either the length of the target sequence or the length of the shorter of the two sequences. Therefore, it is possible that alignments of short fragments produce a high sequence identity but do not result in an accurate model. Measures of statistical significance do not suffer from this normalization problem because the alignment scores are always corrected for the length of the aligned segment before the significance is computed (22,23).
After a list of all related protein structures and their alignments with the target sequence has been obtained, template structures are usually prioritized depending on the purpose of the comparative model. Template structures may be chosen based purely on the target-template sequence identity or a combination of several other criteria, such as the experimental accuracy of the structures (resolution of x-ray structures, number of restraints per residue for NMR structures), conservation of active site residues, holo-structures that have bound ligands of interest, and prior biological information that pertains to the solvent, pH, and quaternary contacts.
Although fold assignment and sequence-structure alignment are logically two distinct steps in the process of comparative modeling, in practice almost all fold assignment methods also provide sequence-structure alignments. In the past, fold assignment methods were optimized for better sensitivity in detecting remotely related homologues, often at the cost of alignment accuracy. However, recent methods simultaneously optimize both the sensitivity and alignment accuracy. For the sake of clarity, however, they are still considered as separate steps in the current chapter.
Most alignment methods use local or global dynamic programming algorithms to derive the optimal alignment between two or more sequences and/or structures. The methods, however, vary in terms of the scoring function that is being optimized. The differences are usually in the form of the gap-penalty function (linear, affine, or variable) (9), the substitution matrix used to score the aligned residues (20 × 20 matrices derived from alignments with a given sequence identity, those derived from structural alignments, those incorporating the structural environment of the residues) (24), or combinations of both (25, 26, 27, 28). There does not yet exist a single universal scoring function that guarantees the most accurate alignment for all situations. Above 30–40% sequence identity, alignments produced by almost all of the methods are similar. However, in the twilight and midnight zones of sequence identity, models based on the alignments of different methods tend to have significant variations in accuracy. Improving the performance and accuracy of methods in this regime remains one of the main tasks of comparative modeling (29,30).
The single source of errors with the largest impact on comparative modeling is misalignments, especially when the target-template sequence identity decreases below 30%. It is imperative to calculate an accurate alignment between the target-template pair, as comparative modeling can almost never recover from an alignment error (31).
Comparative models do not reflect the fluctuations of a protein structure in solution. That is, the variability seen in the structures of multiple models built for one set of inputs reflect different solutions to the molecular objective function, which do not correspond to the actual dynamics of the protein structure in nature.
If there are no large differences among the template structures (>2 Å backbone RMSD) and no long insertions or deletions (>5 residues) between the target and the template(s), building multiple models generally does not drastically improve the accuracy of the best model produced. For alignments to similar templates that lack many gapped regions, building multiple models from the same input alignment most often results in a narrow distribution of accuracies: the difference between the Cα RMSD values between each model and the true native structure is usually within a range of 0.5 Å for a sequence containing approximately 150 residues (5). If, however, the sequence-structure alignment contains different templates with many insertions and/or deletions, it is important to calculate multiple models for the same alignment. Calculating multiple models allows for better sampling of the different templates segments and the conformations of the unaligned regions, and often results in a more accurate model than if only one model had been produced.
A consensus model is calculated by first clustering an ensemble of models and then averaging individual atomic positions. The consensus model is then optimized using the same protocol used on the individual models. Construction of a consensus model followed by optimization usually results in a model with a lower objective function than any of the contributing models; the construction of a consensus model can thus be seen as a part of an efficient optimization. When there are substantial variations in regions of the contributing models, due to the variation among the templates and the presence of gaps in the alignment, calculating the consensus using cluster averaging usually produces the most accurate conformation.
The MODELLER objective function is a measure of how well the model satisfies the input spatial restraints. Lower values of the objective function indicate a better fit with the input data and, thus, models that are likely to be more accurate (5).
The Discrete Optimized Protein Energy (DOPE) (12) is an atomic distance-dependent statistical potential based on a physical reference state that accounts for the finite size and spherical shape of proteins. The reference state assumes a protein chain consists of non-interacting atoms in a homogeneous sphere of equivalent radius to that of the corresponding protein. The DOPE potential was derived by comparing the distance statistics from a non-redundant PDB subset of 1,472 high-resolution protein structures with the distance distribution function of the reference state. By default, the DOPE score is not included in the model building routine, and thus can be used as an independent assessment of the accuracy of the output models. The DOPE score assigns a score for a model by considering the positions of all non-hydrogen atoms, with lower scores corresponding to models that are predicted to be more accurate.
The GA341 criterion is a composite fold-assessment score that combines a Z-score calculated with a statistical potential function, target-template sequence identity, and a measure of structural compactness (18,19). The score ranges from 0.0 for models that tend to have an incorrect fold to 1.0 for models that tend to be comparable to low-resolution x-ray structures. Comparison of models with their corresponding experimental structures indicates that models with GA341 scores greater than 0.7 generally have the correct fold with more than 35% of the backbone atoms superposable within 3.5Å of their native positions. Reliable models (GA341 score ≥ 0.7) based on alignments with more than 40% sequence identity, have a median overlap of more than 90% with the corresponding experimental structure. In the 30–40% sequence identity range, the overlap is usually between 75–90% and below 30% it drops to 50–75%, or even less in the worst cases.
The accuracy of a model should first be assessed using the GA341 score to increase or decrease our confidence in the fold of the model. An assessment of an incorrect fold implies that an incorrect template(s) was chosen or an incorrect alignment with the correct template was used for model calculation. When the target-template relationship falls in the twilight or midnight zones, it is usually difficult to differentiate between these two kinds of errors. In such cases, building models based on different sets of templates may resolve the problem.
Different measures to predict errors in a protein structure perform best at different levels of resolution. For instance, physics-based forcefields may be helpful at identifying the best model when all models are very close to the native state (<1.5 Å RMSD, corresponding to approximately 85% target-template sequence identity). In contrast, coarse-grained scores such as distance-based statistical potentials have been shown to have the greatest ability to differentiate between models in the approximately 3 Å Cα RMSD range. Tests show that such scores are often able to identify a model within 0.5 Å Cα RMSD of the most accurate model produced (32). When multiple models are built, the DOPE score generally selects a more accurate model than the MODELLER objective function.
Segments of the target sequence that have no equivalent region in the template structure (i.e., insertions or loops) are among the most difficult regions to model (4,33, 34, 35). This difficulty is compounded when the target and template are distantly related, with errors in the alignment leading to incorrect positions of the insertions and distortions in the loop environment. Using alignment methods that incorporate structural information can often correct such errors (9). Once a reliable alignment is obtained, various modeling protocols can predict the loop conformation, for insertions of less than 8–10 residues long (4,33,36,37).
As a consequence of sequence divergence, the mainchain conformation of a protein can change, even if the overall fold remains the same. Therefore, it is possible that in some correctly aligned segments of a model, the template is locally different (<3 Å) from the target, resulting in errors in that region. The structural differences are sometimes not due to differences in sequence, but are a consequence of artifacts in structure determination or structure determination in different environments (e.g., packing of subunits in a crystal). The simultaneous use of several templates can minimize this kind of an error (38,39).
- 2.Pieper, U., Eswar, N., Davis, F. P., Braberg, H., Madhusudhan, M. S., Rossi, A., Marti-Renom, M., Karchin, R., Webb, B. M., Eramian, D., Shen, M. Y., Kelly, L., Melo, F., and Sali, A. (2006) MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res. 34, D291–295.CrossRefPubMedGoogle Scholar
- 6.Eswar, N., Madhusudhan, M.S., Marti-Renom, M.A., Sali, A. (2005) BUILD_ PROFILE: a module for calculating sequence profiles in MODELLER. http://www.salilab.org/modeller
- 8.Eswar, N., Madhusudhan, M.S., Marti-Renom, M.A., Sali, A. (2005) PROFILE_ SCAN: A module for fold-assignment using profile-profile scanning in MODELLER. http://www.salilab.org/modeller
- 10.MacKerell, A. D., Jr., Bashford, D., Bellott, M., Dunbrack, R. L., Jr., Evanseck, J. D., Field, M. J., Fischer, S., Gao, J., Guo, H., Ha, S., Joseph-McCarthy, D., Kuchnir, L., Muczera, K., Lau, F. T. K., Mattos, C., Michnik, S., Nguyen, D. T., Ngo, T., Prodhom, B., reiher, W. E., III, Roux, B., Schlenkrich, M., Smith, J. C., Stote, R., Straub, J., Watanabe, M., Wiorkiewicz-Kuczera, J., Yin, D., and Karplus, M. (1998) All-atom empirical potential for molecular modleing and dynamics studies of proteins. J.Phys.Chem.B. 102, 3586–3616.CrossRefGoogle Scholar
- 14.Deshpande, N., Addess, K. J., Bluhm, W. F., Merino-Ott, J. C., Townsend-Merino, W., Zhang, Q., Knezevich, C., Xie, L., Chen, L., Feng, Z., Green, R. K., Flippen-Anderson, J. L., Westbrook, J., Berman, H. M., and Bourne, P. E. (2005) The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res. 33, D233–237.CrossRefPubMedGoogle Scholar
- 17.Madhusudhan, M. S., Eswar, N., Marti-Renom, M.A., Sali, A. (2005) SALIGN: A module for multiple sequence/structure alignments in MODELLER. http://www.salilab.org/modeller