Flexible mapping of homology onto structure with Homolmapper

Rockwell, Nathan C; Lagarias, J Clark

doi:10.1186/1471-2105-8-123

Flexible mapping of homology onto structure with Homolmapper

Software
Open access
Published: 11 April 2007

Volume 8, article number 123, (2007)
Cite this article

Download PDF

You have full access to this open access article

BMC Bioinformatics Aims and scope Submit manuscript

Flexible mapping of homology onto structure with Homolmapper

Download PDF

Nathan C Rockwell¹ &
J Clark Lagarias¹

5799 Accesses
19 Citations
Explore all metrics

Abstract

Background

Over the past decade, a number of tools have emerged for the examination of homology relationships among protein sequences in a structural context. Most recent software implementations for such analysis are tied to specific molecular viewing programs, which can be problematic for collaborations involving multiple viewing environments. Incorporation into larger packages also adds complications for users interested in adding their own scoring schemes or in analyzing proteins incorporating unusual amino acid residues such as selenocysteine.

Results

We describe homolmapper, a command-line application for mapping information from a multiple protein sequence alignment onto a protein structure for analysis in the viewing software of the user's choice. Homolmapper is small (under 250 K for the application itself) and is written in Python to ensure portability. It is released for non-commercial use under a modified University of California BSD license. Homolmapper permits facile import of additional scoring schemes and can incorporate arbitrary additional amino acids to allow handling of residues such as selenocysteine or pyrrolysine. Homolmapper also provides tools for defining and analyzing subfamilies relative to a larger alignment, for mutual information analysis, and for rapidly visualizing the locations of mutations and multi-residue motifs.

Conclusion

Homolmapper is a useful tool for analysis of homology relationships among proteins in a structural context. There is also extensive, example-driven documentation available. More information about homolmapper is available at http://www.mcb.ucdavis.edu/faculty-labs/lagarias/homolmapper_home/homolmapper%20web%20page.htm.

Comparative Protein Structure Analysis with Bio3D-Web

CoeViz: a web-based tool for coevolution analysis of protein residues

Article Open access 08 March 2016

PROMALS3D: Multiple Protein Sequence Alignment Enhanced with Evolutionary and Three-Dimensional Structural Information

Background

The proliferation of known or suspected protein sequences in the post-genomic era and the slower, but steady, progress in protein structure determination implies that there are a large number of proteins with no experimentally determined structure but with varying homology to a protein of known structure. When such proteins are of experimental or practical interest, several approaches can be used to generate some form of approximate structural information to aid in the design or interpretation of biochemical experiments. For example, homology modeling can be used to build a model structure for the protein of interest based on the known structure. First introduced in 1969 [1], homology modeling has become a more commonly applied tool in recent years. However, homology modeling remains sensitive not only to the multiple sequence alignment (MSA) of homologous proteins, but also to the particular algorithms used in the structural modeling [2]. Moreover, the confidence assigned to such models is dependent on the evolutionary relatedness of the modeling target to the known template.

An alternative approach is to map information from the MSA onto the known structure, typically via color-coding or similar visual procedures. Such combinations of structural and sequence information do not contain as much information as carefully made homology models, but they permit many of the same applications as homology modeling, such as predicting the location and/or function of potentially interesting residues, and are also useful in evaluating and modifying homology models themselves: a homology model that places a site of frequent gaps and insertions in the middle of a central beta strand rather than on a surface loop is unlikely to be reliable. Scoring the MSA onto the known structure is driven only by the MSA, without potential bias from the modeling or threading programs. This approach was formalized in 1995 with PDBALIGN [3], a command-line application taking as inputs a protein structure in PDB format and an MSA in GCG format. PDBALIGN generates a new PDB file retaining the original sequence and spatial information, but with homology information written to the B factor (or temperature factor) field of the output PDB file. Scoring schemes cannot be added to PDBALIGN without changing the source code and recompiling, and the program is similarly limited in its ability to handle residues other than the canonical 20 amino acids without manual editing of the PDB file.

In the years since the appearance of PDBALIGN, a number of other programs have been developed that permit such evaluation of sequence information in various contexts, including SwissModel [4], Chimera [5], Protein Explorer [6], ConSurf [7–10], STING [11–14], and several commercial packages. Most recently, MultiSeq has been reported [15] as an extension to the popular VMD structural analysis software [16] with an emphasis on evolutionary relationships. The underlying trend in more recent software is therefore to tie presentation of information from the MSA to a single viewing environment, such as Chimera or VMD. Such software is thus not well suited to collaborations among workers who prefer different environments and may not be well-suited to automated operation because of the emphasis on integration with a viewing and analysis package or web server largely driven by a graphical user interface (GUI). Therefore, we believe there is a niche for a flexible, up-to-date, open-source command-line tool for mapping the homology information from protein sequence alignments onto protein structures.

We have developed homolmapper as a command-line Python application for such work. Homolmapper places special emphasis on visualizing additional information such as the location of mutations, on defining and characterizing a subfamily within a given alignment, and on allowing runtime extensibility for user-defined scoring schemes and for handling of noncanonical residues in the structure and/or alignment. Homolmapper currently uses PDB files and several different MSA formats as inputs and outputs a new PDB file with homology information written to fields that do not change the spatial information. This permits the user to view the results in the viewing environment of their own choice. It is also possible to use a position-specific scoring matrix (PSSM) generated by PSI-BLAST [17] instead of an MSA. Homolmapper is small (< 250 K), portable, and requires only a working Python installation of version 2.3 or later [18].

Implementation and results

Rationale for program operation

Within a given PDB file, individual atoms are assigned a single ATOM or HETATM record identifying the atom and locating it within a 3-dimensional Cartesian coordinate system (see [19] for a full description of the PDB format). Within such records, individual fields such as residue number, atom name, and z coordinate are defined by column offset. The PDB format also provides several additional fields that provide either additional experimental detail about the structure determination (occupancy and B factor, the latter variously known as B field, B factor, and temperature factor) or provide additional descriptive information (SegID, element, and charge).

B factor has been previously used as a convenient field for storing information about homology [3]. Like B field, occupancy is a 6-column floating-point column well suited to storing such information; the other fields dispensable for unique specification of the atom name and location are smaller (SegID is 4 columns, while element and charge are each only 2 columns) and are less frequently supported by viewing programs. Homolmapper writes information to occupancy and/or B factor; additional outputs are written to the SegID field (introduced in 1996 and still in use by applications such as MODELLER, [20]), albeit with less precision. The element field can also be used. The primary output is a new PDB file to permit the user to evaluate the results in the viewing program of choice. Of course, this imposes fundamental limitations on the nature of the outputs, because they must fit the PDB standard. Homolmapper results also must be archived as complete structure files rather than simple parameter lists, although the extra storage requirements are modest and such files are by their nature stand-alone and not linked to a single "master" structure file. Homolmapper can also report scores as a tab-delimited text file for analysis with other software tools.

Overall architecture

Homolmapper is written in Python using a functional approach and has an overall layout resembling a C program [see Additional files 1 &2]. Extensive commenting is provided within the main script for users wishing to alter the program itself, but such editing is not needed for normal operations nor for many types of user customization. Homolmapper is a strictly command-line application; user control is provided via a large number of flags that can be added to the command line to control various aspects of program behavior [see Additional file 3].

Homolmapper can also use specialized text files to import information on a number of options, such as subfamilies, mutations, non-standard residues, and additional scoring schemes. It is possible to supply some or all of the desired settings, including input files, in a text file for repetitive tasks or to supply preferred settings. This feature also aids in automation of homolmapper for batch jobs. Homolmapper can also generate such settings files, as well as generating files containing the subfamily definition, protein sequences from the input PDB file, or runtime alignment (in CLUSTAL format, [21, 22]). Homolmapper output files have headers indicating their source; most provide basic information about the run parameters used in preparing the file.

Matching the structure to the alignment

To produce meaningful output, software such as homolmapper must synchronize the protein structure and the MSA by matching the sequence contained in the protein structure to a sequence in the MSA. In homolmapper, this matching process is by default automatic. Homolmapper will first check for exact matches to the entire structure sequence (step one) and then, assuming no matches were found, for exact matches to each chain in the PDB file (step two). If no significant matches are found at this point, the sequence in each chain will be broken into short blocks, and exact matches to those blocks will be used to match the sequence (step three). Should this fail, a slower recursive algorithm will be employed (step four). If no significant matches can be found, an error will be triggered and execution will cease.

The chosen matched sequence is used to determine which portions of the structure are exact matches to the MSA; only these residues will be scored, so that the presence of exogenous sequence such as a His tag will not result in spurious results. By default, homolmapper will pass all unscored atoms to the output file without changes, but it is possible to set the output fields of unscored atoms to zero (occupancy and B factor) or to empty fields (SegID and element) if so requested.

The user has control over several aspects of this process. It is possible to choose the first matching method used, so that methods can be skipped if so chosen. It is also possible to redefine the threshold of significance for the entire matched sequence, to specify the size of the peptide blocks used at the third step, and to specify the smallest significant fragments used by the recursive algorithm. Additionally, the user can supply the name of one of the sequences in the MSA. If such a name is supplied and matches a name in the MSA, homolmapper will only test that name as a significant match. If an invalid name is supplied, homolmapper will warn the user and use the default method.

Scoring schemes and reference sequences

Homolmapper can score sequences by gaps, insertions, identity, Shannon's entropy [23], information content, and several similarity schemes. Shannon's entropy and information content can also use user-defined amino acid sets, which readily permits correction for chemical similarity among residues in entropy scoring [24]. These schemes are by default reported in bits (i.e. taken as log base 2), but the user can choose an alternative base for the logarithm.

Rather than trying to use a single objective function for scoring the MSA including gaps [25, 26], homolmapper uses separate analysis of gaps and insertion. This reflects its intended uses in analyzing homology models and in structure/function studies, because gaps and insertions are useful in evaluating the overall correctness of fold in homology modeling with distant templates (Figure 1). The elongation of surface loops has also been implicated in cold adaptation of enzymes from psychrophilic organisms [27–29]. Gaps are here defined as positions where the matched sequence lacks amino acids but other proteins in the alignment have them, and insertions are defined as positions where the matched sequence has amino acids but at least one other protein lacks them. Gaps and insertions are therefore always scored relative to the matched sequence. Gaps and insertions can be scored by length or by frequency (Figure 2A).

Homolmapper offers several approaches for calculating similarity of amino acids. Identity can be calculated by a simple arithmetic procedure. This procedure compares each residue in the alignment to the aligned residue in a reference sequence, and the number of identical residues will be divided by the total number of sequences in the alignment. Alternately, the number of identical residues can be divided by the number of sequences that have a residue at the position in question to avoid penalizing for gaps, which can be useful for locating rare but conserved insertions.

Several types of similarity can be calculated via a matrix lookup procedure which makes it easy to add scoring schemes by importing them as text files at runtime. BLOSUM62 [30], PAM250 [31], and charge matrices are incorporated into the homolmapper script itself. A total of 35 other scoring schemes are supplied as formatted text files as part of the homolmapper distribution, including series of BLOSUM [30], PAM [31], and Gonnet [32] substitution matrices and additional scoring matrices based on physical properties such as hydrophobicity, sidechain entropy of folding, and nonpolar accessible surface area (Table 1). Any or all of these schemes can be imported at runtime as needed, although homolmapper will use at most two such schemes because of the limited number of output fields available (occupancy and B factor). Homolmapper can report both the score (to occupancy) and the variation (to B factor) at each position for any matrix scheme, permitting assessment of both conservation and variability (Figure 3). Variation can be reported as standard deviation or standard error of the scores at each position in the MSA, or as the range of scores spanned at each position. Homolmapper also reports mean, standard deviation, and score diversity for the scores themselves in the header of the output PDB file. Score diversity is here calculated as the percentage of the maximum Shannon entropy for the score in question, and is therefore akin to the DOPS measure used in assessing alignment scoring in ScoreCONS [26]. Scoring schemes can also be reported as Z-scores relative to the calculated mean and standard deviation.

Table 1 Similarity scoring schemes distributed with homolmapper.

Full size table

In addition to identity and the matrix-lookup schemes, homolmapper can score using user-defined amino acid sets. Such sets treat a group of amino acids as equivalent for scoring. Homolmapper can report the percentage of sequences having a member of such a set at each position in the structure ("degen" scoring, Figure 2B), or it can define all amino acids as belonging to sets and then report the percentage of sequences having members of the same set as a reference sequence for each position ("sloppy" scoring, Figure 2B). This feature gives the user the ability to quickly look for features such as hydrogen bond donors at an unusual pH without taking the time to create a suitable scoring matrix. It is possible to disable gap penalties for scoring by matrix-lookup and by user-defined amino acid sets.

Several of these scoring schemes are relative. For example, calculating percent identity implies the existence of a reference sequence to which the others will be compared. By default, homolmapper uses the matched sequence (effectively, the structure sequence) as the reference sequence; as discussed above, this is always the case for scoring gaps or insertions. However, schemes such as identity or similarity matrices can instead use a consensus sequence or a different sequence found in the alignment as the reference sequence. Consensus sequences can be supplied as part of the input MSA, permitting import of consensus sequences from sources such as HSSP or STING [9, 12, 33, 34], or they can be generated by homolmapper at runtime as needed. A summary of the available reference sequence choices for different scoring schemes is presented in Table 2.

Table 2 Homolmapper reference sequences and scoring schemes

Full size table

Homolmapper can also use the SegID field of the output PDB file to report a number of additional features. The SegID field can be used to display a consensus sequence, to show which residue in the structure is assigned to which user-defined amino acid set (Figure 2B), or to show the amino acid numbers in the matched sequence corresponding to each residue in the structure. The SegID field is also used to report information about mutations, highlights, and multi-residue motifs.

Implementation of mutations, residue highlights, and multi-residue motifs

Homolmapper can process information about mutations, reporting the results to the SegID field of the output PDB file. Mutations can be described via the command line or via an auxiliary text file supplied as an extra input. Mutations can simply be labeled on the structure, so that their probable location can be visualized to aid in interpretation, or they can be scored with a variety of algorithms. For example, mutations can be scored by conservation of the wild-type residue found in the protein in which the mutation was described, by conservation of the mutant amino acid for substitutions, or by the percentage of sequences lacking amino acids after a given position for comparison to a C-terminal truncation. Homolmapper can also score different types of mutations by different algorithms (Table 3) to provide a crude estimate of how well various mutations might be tolerated.

Table 3 Scoring algorithms for assessing mutations by type

Full size table

Homolmapper can also use the mechanisms for mutation entry to input highlights and multi-residue motifs. In highlighting, homolmapper will label the SegID field of each residue in the structure aligned with an amino acid of a given type in a different sequence in the alignment. This permits rapid visualization of where such residues would fall on the known structure, allowing assessment of potential surface accessibility, disulfide formation, or similar properties. For multi-residue motifs, the user can list allowed amino acids at each position in the motif, and homolmapper will report the percentage of sequences in the alignment that meet all criteria to the SegID field of each atom in each residue aligned with a position in the motif. If more than one motif is specified, a 1–2 character motif identifier must be supplied to allow homolmapper to track which position belongs to which motif. This motif identifier will be written to the element field of the output PDB file.

Handling of subfamilies

Homolmapper can also compare a single subfamily to the entire alignment, reporting the results of a single scoring scheme for the subfamily (occupancy) and the entire MSA (B factor). Subfamilies can be defined by the user either explicitly (listing all members or all non-members), implicitly (by definition of amino acids that must be present or absent at certain positions for membership), by pattern-matching to sequence names (regular expressions), or by a combination of these methods. Subfamily definitions can be supplied in a text file or via the command line. It is also possible to use similar definition syntax to select a subset of the MSA for discard, permitting the user to progress from a general analysis to a more specific examination of certain sequences without having to generate a new MSA file.

In addition to permitting comparison of the subfamily to the entire MSA, homolmapper can be used to search for residues or user-defined sets that are specific to the subfamily. This is implemented by first locating positions where an allowed residue is present in a reference sequence within the subfamily, then summing the sequences within the subfamily that lack such an allowed residue. If the resulting sum is below a user-controlled tolerance, sequences outside of the subfamily that possess an allowed residue at that position are summed and compared to a second tolerance. Only positions that are below the tolerance for both tests are reported. If so requested, homolmapper can automatically search for all possible such residues and report any detected residues to SegID while leaving other outputs free for other functions. Alternately, the user can designate the occupancy and/or B-factor to search for particular residues or residue sets of interest.

This method for detecting subfamily-specific residues has the advantage that it can be applied to small subfamilies or small alignments while retaining vestigial stringency by setting the tolerances to zero. Thus, for cases in which the total sample size is too small to be statistically valid, this approach nevertheless permits rapid detection of candidate residues for further characterization, for example by site-directed mutagenesis.

Mutual information analysis

A challenge in analysis of any protein MSA is the detection of coevolving or covarying residues: while highly conserved residues are often important for protein structure or function, variable residues that interact with each other are much harder to detect yet can nonetheless be functionally or structurally important. The mutual information of pairs of positions in the MSA has been applied as a general means of analyzing such positions [35, 36]. Homolmapper can calculate mutual information for an MSA by evaluating the joint entropies of all pairs of positions in the MSA and then subtracting the joint entropy from the individual position entropies (Figure 4). The resulting raw mutual-information scores can be normalized by the joint entropy [36] or by the sum of the position entropies (to yield the redundancy). Finally, scores are converted to Z-scores for further analysis.

Reporting the results of such analysis within the confines of the PDB format is difficult. We have therefore adopted a compromise scheme whereby the maximum mutual information Z-score is reported to B-factor for every residue, regardless of what that score is (Figure 4). For residues having maximum Z-scores above a threshold, the residue number associated with that score is reported to SegID. Such residues are considered to be part of a mutually informative group, and the remaining outputs (occupancy and element) are dedicated to describing that group. The number of residues in that group is reported to element, and the sum of their residue numbers (in the numbering of the MSA) is reported to occupancy. Thus, for a mutually informative pair with residue numbers 136 and 140, element would be set to 2 and occupancy would be set to 276 (Figure 4). This scheme allows residues that are part of overlapping groups to have similar occupancy values for visualization, and a full report of significant hits is also written out to a text file. The time-consuming aspect of this analysis is the calculation of the joint-entropy matrix, but homolmapper will by default punch this matrix to a text file for reuse. It is not possible to combine mutual information analysis with other scoring choices, because this scheme uses all of the current output fields.

Using a PSSM in lieu of an MSA

Homolmapper can also take a PSSM as an input for users wishing to avoid the time involved in constructing an MSA. The PSSM parsed by homolmapper is the ASCII output PSSM generated by PSI-BLAST [17] with the "-Q" option. Scoring a PSSM will always result in the consensus sequence being reported to the element field. The SegID can be used to report the information per position or the relative weight versus pseudocounts. Occupancy and B-factor can be used for charge, degen, or sloppy scoring as in scoring an MSA, or they can be used for PSSM-specific parameters such as the PSSM value for the query sequence (Table 4).

Table 4 Scoring options for occupancy and B-factor fields with PSSM inputs

Full size table

Handling of non-canonical residues

Many structures contain residues in addition to the canonical 20 amino acids familiar to all students of biochemistry. Such residues include amino acids such as phosphoserine [37], arising from post-translational modification, and residues such as selenomethionine [38] or diiodotyrosine [39], arising from experimental manipulations. Moreover, proteins containing selenocysteine (Cse, [40, 41]) or pyrrolysine [42, 43] effectively contain "extra" genetically encoded amino acids, as do proteins incorporating unnatural amino acids via modified tRNA techniques [44]. Non-standard residues may also appear in the alignment, due to sequencing ambiguity or due to the presence of noncanonical residues such as Cse.

Homolmapper is able to handle all of these cases. By default, homolmapper will recognize five easily translated residues in PDB files and translate them to their genetically encoded equivalents: phosphoserine, phosphothreonine, phosphotyrosine, hydroxyproline, and selenomethionine are translated to Ser, Thr, Tyr, Pro, and Met, respectively. Translation of other residues in PDB files is accomplished via an extra text file or command-line argument converting the three-letter residue code(s) in the PDB file to one-letter codes found in the MSA. The use of a text file for this purpose permits the user to develop a library of non-standard residues from a number of structure files for repeated reuse.

Homolmapper can also translate non-standard residues in the MSA via a command-line flag. Such residues are again translated to one of the standard one-letter codes for matrix-based scoring schemes, although no such translation is necessary for scoring gaps, insertions, or identity. This translation is handled via the command line because there is less standardization of such codes than is the case for noncanonical residues in PDB files.

Handling of residues such as Cse essentially involves expanding the set of amino acids parsed by homolmapper. This is accomplished with a separate command-line flag expanding the parsed amino acid set in combination with a text file or command-line flag equating the PDB code for the residue in question with the new one-letter code for the "extra" amino acid. There is no limit on the number of amino acids that can be added.

User extensibility

Homolmapper offers a number of features designed to permit facile customization. Many of the more advanced features of homolmapper involve loading additional text files, so the user can readily maintain a library of frequently used accessory files, including run-settings files. As discussed, it is also possible to import similarity scoring schemes at runtime. Such schemes are normally stored as pre-formatted Python dictionaries that are loaded and compiled at runtime. The homolmapper distribution includes a small utility that generates matrices in this format from user-supplied text files describing per-residue properties, making user generation of new schemes much faster and more reliable. Homolmapper can also import these schemes as plain text files.

Homolmapper is controlled by flags specified on the command line. Such an interface normally requires the user to memorize the names of the flags that are relevant to their own work. However, homolmapper permits the user to supply synonyms that can then be redefined by the application itself, so that the user can rename flags should they find their names difficult to remember. Such redefinitions can be done on the command line or via an accessory text file. This file can then be referenced in a run-settings file for repeated reuse, permitting the user to use their own mnemonics.

Documentation

Homolmapper is distributed with examples of all the accessory input files [see Additional files 1 &2]. Each of these examples is heavily commented to aid in understanding the uses and formatting for the various files. Homolmapper also can generate considerable help information at runtime, via flags such as – help or – files. The homolmapper script itself is heavily commented, although these comments are intended to aid in programming rather than use. Several small utilities are also included in this distribution.

In addition to the documentation distributed with homolmapper itself, a User Guide is available as a separate download [see Additional file 3]. This PDF document provides demonstrations of many aspects of homolmapper operation with included structures and alignments, including both basic operations and more advanced applications such as working with expanded amino acid sets or mutual information analysis. All required files are included with this distribution.

Discussion

Planned future development

Development of homolmapper is an ongoing process. Improvements are planned in several aspects of program operation, including additional scoring schemes, acceptance of additional file formats for the MSA, handling of multiple structural inputs, and greater flexibility in handling non-standard PDB files and in scoring PSSM inputs. The open architecture of the Python implementation also facilitates distribution of user-suggested improvements to a wider user community. Techniques for comparison of multiple structures and subfamilies will eventually be incorporated as long-term improvements, and a dedicated GUI is ultimately planned.

Intended uses

Homolmapper is well suited to general visualization of homology relationships, particularly in collaborative environments where different workers are using different molecular viewing programs. It is also intended to aid in evaluation of candidate homology models, in visualizing the locations of mutations or motifs, and in comparison of a subfamily to an entire MSA, including location of subfamily-specific residues. Homolmapper is thus a useful addition to the range of software permitting visualization of homology relationships in terms of protein structure.

Conclusion

Earlier, partially functional versions of homolmapper have already proven useful for evaluating homology models [45, 46] and for examining homology and mutational information in a structural context [47, 48]. Homolmapper is designed to permit some types of user customization with minimal effort and no programming. Homolmapper is distributed under a modified BSD to permit interested users to work with the program itself as desired. It is particularly well suited to collaborations involving different molecular viewing environments. Extensive documentation is available, including many examples, and the homolmapper script itself is small and portable. We anticipate that this application will prove a useful tool for workers investigating structure and function, structural modeling, and other fields suited to evaluation of protein sequence homology in a structural context.

Availability and requirements

Project name: homolmapper
Project home page: homolmapper home page at the Lagarias lab site, URL: http://www.mcb.ucdavis.edu/faculty-labs/lagarias/homolmapper_home/homolmapper%20web%20page.htm (point-and-click license agreement with no registration; documentation with examples available at same URL). Also supplied with this manuscript as additional files.
Operating system(s): Platform independent
Programming language: Python
Other requirements: Python 2.3 or higher; standard Python modules (sys, time, and os including the os.path submodule); Windows versions require the win32 extensions to Python (sourceforge.net/projects/pywin32) for automated loading of scoring schemes.
License: based on University of California BSD
Any restrictions to use by non-academics: yes

Abbreviations

Cse:: selenocysteine
GUI:: Graphical User Interface
MSA:: multiple (protein) sequence alignment
PDB file:: Protein Data Bank file format for protein structures [19]
PSSM:: position-specific scoring matrix

References

Browne WJ, North AC, Phillips DC, Brew K, Vanaman TC, Hill RL: A possible three-dimensional structure of bovine alpha-lactalbumin based on that of hen's egg-white lysozyme. J Mol Biol 1969, 42(1):65–86. 10.1016/0022-2836(69)90487-2
Article CAS PubMed Google Scholar
Wallner B, Elofsson A: All are not equal: a benchmark of different homology modeling programs. Protein Sci 2005, 14(5):1315–1327. 10.1110/ps.041253405
Article PubMed Central CAS PubMed Google Scholar
Sayle R, Saqi M, Weir M, Lyall A: PdbAlign, PdbDist and DistAlign: tools to aid in relating sequence variability to structure. Comput Appl Biosci 1995, 11(5):571–573.
CAS PubMed Google Scholar
Guex N, Diemand A, Peitsch MC: Protein modelling for all. Trends Biochem Sci 1999, 24(9):364–367. 10.1016/S0968-0004(99)01427-9
Article CAS PubMed Google Scholar
Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera--a visualization system for exploratory research and analysis. J Comput Chem 2004, 25(13):1605–1612. 10.1002/jcc.20084
Article CAS PubMed Google Scholar
Martz E: Protein Explorer: easy yet powerful macromolecular visualization. Trends Biochem Sci 2002, 27(2):107–109. 10.1016/S0968-0004(01)02008-4
Article CAS PubMed Google Scholar
Armon A, Graur D, Ben-Tal N: ConSurf: an algorithmic tool for the identification of functional regions in proteins by surface mapping of phylogenetic information. J Mol Biol 2001, 307(1):447–463. 10.1006/jmbi.2000.4474
Article CAS PubMed Google Scholar
Glaser F, Pupko T, Paz I, Bell RE, Bechor-Shental D, Martz E, Ben-Tal N: ConSurf: identification of functional regions in proteins by surface-mapping of phylogenetic information. Bioinformatics 2003, 19(1):163–164. 10.1093/bioinformatics/19.1.163
Article CAS PubMed Google Scholar
Glaser F, Rosenberg Y, Kessel A, Pupko T, Ben-Tal N: The ConSurf-HSSP database: the mapping of evolutionary conservation among homologs onto PDB structures. Proteins 2005, 58(3):610–617. 10.1002/prot.20305
Article CAS PubMed Google Scholar
Landau M, Mayrose I, Rosenberg Y, Glaser F, Martz E, Pupko T, Ben-Tal N: ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 2005, 33(Web Server issue):W299–302. 10.1093/nar/gki370
Article PubMed Central CAS PubMed Google Scholar
Mancini AL, Higa RH, Oliveira A, Dominiquini F, Kuser PR, Yamagishi ME, Togawa RC, Neshich G: STING Contacts: a web-based application for identification and analysis of amino acid contacts within protein structure and across protein interfaces. Bioinformatics 2004, 20(13):2145–2147. 10.1093/bioinformatics/bth203
Article CAS PubMed Google Scholar
Neshich G, Mancini AL, Yamagishi ME, Kuser PR, Fileto R, Pinto IP, Palandrani JF, Krauchenco JN, Baudet C, Montagner AJ, Higa RH: STING Report: convenient web-based application for graphic and tabular presentations of protein sequence, structure and function descriptors from the STING database. Nucleic Acids Res 2005, 33(Database issue):D269–74. 10.1093/nar/gki111
Article PubMed Central CAS PubMed Google Scholar
Neshich G, Mazoni I, Oliveira SR, Yamagishi ME, Kuser-Falcao PR, Borro LC, Morita DU, Souza KR, Almeida GV, Rodrigues DN, Jardine JG, Togawa RC, Mancini AL, Higa RH, Cruz SA, Vieira FD, Santos EH, Melo RC, Santoro MM: The Star STING server: a multiplatform environment for protein structure analysis. Genet Mol Res 2006, 5(4):717–722.
PubMed Google Scholar
Neshich G, Togawa RC, Mancini AL, Kuser PR, Yamagishi ME, Pappas G Jr., Torres WV, Fonseca e Campos T, Ferreira LL, Luna FM, Oliveira AG, Miura RT, Inoue MK, Horita LG, de Souza DF, Dominiquini F, Alvaro A, Lima CS, Ogawa FO, Gomes GB, Palandrani JF, dos Santos GF, de Freitas EM, Mattiuz AR, Costa IC, de Almeida CL, Souza S, Baudet C, Higa RH: STING Millennium: A web-based suite of programs for comprehensive and simultaneous analysis of protein structure and sequence. Nucleic Acids Res 2003, 31(13):3386–3392. 10.1093/nar/gkg578
Article PubMed Central CAS PubMed Google Scholar
Roberts E, Eargle J, Wright D, Luthey-Schulten Z: MultiSeq: unifying sequence and structure data for evolutionary analysis. BMC Bioinformatics 2006, 7: 382. 10.1186/1471-2105-7-382
Article PubMed Central PubMed Google Scholar
Humphrey W, Dalke A, Schulten K: VMD: visual molecular dynamics. Journal of Molecular Graphics 1996, 14(1):33–38. 10.1016/0263-7855(96)00018-5
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389
Article PubMed Central CAS PubMed Google Scholar
Python Programming Language -- Official Website[http://www.python.org]
RCSB Protein Data Bank[http://www.rcsb.org]
Fiser A, Sali A: Modeller: generation and refinement of homology-based protein structure models. Methods Enzymol 2003, 374: 461–491.
Article CAS PubMed Google Scholar
Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG: The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Res 1997, 25(24):4876–4882. 10.1093/nar/25.24.4876
Article PubMed Central CAS PubMed Google Scholar
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673–4680. 10.1093/nar/22.22.4673
Article PubMed Central CAS PubMed Google Scholar
Shannon CE: A mathematical theory of communication. The Bell System Technical J 1948, 27: 379–423, 623–656.
Article Google Scholar
Mirny L, Shakhnovich E: Evolutionary conservation of the folding nucleus. J Mol Biol 2001, 308(2):123–129. 10.1006/jmbi.2001.4602
Article CAS PubMed Google Scholar
Thompson JD, Plewniak F, Ripp R, Thierry JC, Poch O: Towards a reliable objective function for multiple sequence alignments. J Mol Biol 2001, 314(4):937–951. 10.1006/jmbi.2001.5187
Article CAS PubMed Google Scholar
Valdar WS: Scoring residue conservation. Proteins 2002, 48(2):227–241. 10.1002/prot.10146
Article CAS PubMed Google Scholar
Davail S, Feller G, Narinx E, Gerday C: Cold adaptation of proteins. Purification, characterization, and sequence of the heat-labile subtilisin from the antarctic psychrophile Bacillus TA41. J Biol Chem 1994, 269(26):17448–17453.
CAS PubMed Google Scholar
Gerike U, Danson MJ, Russell NJ, Hough DW: Sequencing and expression of the gene encoding a cold-active citrate synthase from an Antarctic bacterium, strain DS2–3R. Eur J Biochem 1997, 248(1):49–57. 10.1111/j.1432-1033.1997.00049.x
Article CAS PubMed Google Scholar
Gudmundsdottir A: Cold-adapted and mesophilic brachyurins. Biol Chem 2002, 383(7–8):1125–1131. 10.1515/BC.2002.122
Article CAS PubMed Google Scholar
Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915
Article PubMed Central CAS PubMed Google Scholar
Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. In Atlas of Protein Sequences and Stucture. Volume 5(3). Edited by: Dayhoff MO. Washington, DC , National Biomedical Research Foundation; 1978:345–352.
Google Scholar
Benner SA, Cohen MA, Gonnet GH: Amino acid substitution during functionally constrained divergent evolution of protein sequences. Protein Eng 1994, 7(11):1323–1332. 10.1093/protein/7.11.1323
Article CAS PubMed Google Scholar
Higa RH, Montagner AJ, Togawa RC, Kuser PR, Yamagishi ME, Mancini AL, Pappas G Jr., Miura RT, Horita LG, Neshich G: ConSSeq: a web-based application for analysis of amino acid conservation based on HSSP database and within context of structure. Bioinformatics 2004, 20(12):1983–1985. 10.1093/bioinformatics/bth185
Article CAS PubMed Google Scholar
Sander C, Schneider R: The HSSP database of protein structure-sequence alignments. Nucleic Acids Res 1994, 22(17):3597–3599.
PubMed Central CAS PubMed Google Scholar
Martin LC, Gloor GB, Dunn SD, Wahl LM: Using information theory to search for co-evolving residues in proteins. Bioinformatics 2005, 21(22):4116–4124. 10.1093/bioinformatics/bti671
Article CAS PubMed Google Scholar
Gloor GB, Martin LC, Wahl LM, Dunn SD: Mutual information in protein multiple sequence alignments reveals two classes of coevolving positions. Biochemistry 2005, 44(19):7156–7165. 10.1021/bi050293e
Article CAS PubMed Google Scholar
Chacko BM, Qin BY, Tiwari A, Shi G, Lam S, Hayward LJ, De Caestecker M, Lin K: Structural basis of heteromeric smad protein assembly in TGF-beta signaling. Mol Cell 2004, 15(5):813–823. 10.1016/j.molcel.2004.07.016
Article CAS PubMed Google Scholar
Hendrickson WA, Horton JR, LeMaster DM: Selenomethionyl proteins produced for analysis by multiwavelength anomalous diffraction (MAD): a vehicle for direct determination of three-dimensional structure. Embo J 1990, 9(5):1665–1672.
PubMed Central CAS PubMed Google Scholar
Ghosh D, Erman M, Sawicki M, Lala P, Weeks DR, Li N, Pangborn W, Thiel DJ, Jornvall H, Gutierrez R, Eyzaguirre J: Determination of a protein structure by iodination: the structure of iodinated acetylxylan esterase. Acta Crystallogr D Biol Crystallogr 1999, 55(Pt 4):779–784. 10.1107/S0907444999000244
Article CAS PubMed Google Scholar
Allmang C, Krol A: Selenoprotein synthesis: UGA does not end the story. Biochimie 2006, 88(11):1561–1571. 10.1016/j.biochi.2006.04.015
Article CAS PubMed Google Scholar
Copeland PR: Regulation of gene expression by stop codon recoding: selenocysteine. Gene 2003, 312: 17–25. 10.1016/S0378-1119(03)00588-2
Article PubMed Central CAS PubMed Google Scholar
Hao B, Gong W, Ferguson TK, James CM, Krzycki JA, Chan MK: A new UAG-encoded residue in the structure of a methanogen methyltransferase. Science 2002, 296(5572):1462–1466. 10.1126/science.1069556
Article CAS PubMed Google Scholar
Srinivasan G, James CM, Krzycki JA: Pyrrolysine encoded by UAG in Archaea: charging of a UAG-decoding specialized tRNA. Science 2002, 296(5572):1459–1462. 10.1126/science.1069588
Article CAS PubMed Google Scholar
Xie J, Schultz PG: A chemical toolkit for proteins--an expanded genetic code. Nat Rev Mol Cell Biol 2006, 7(10):775–782. 10.1038/nrm2005
Article CAS PubMed Google Scholar
Fischer AJ, Rockwell NC, Jang AY, Ernst LA, Waggoner AS, Duan Y, Lei H, Lagarias JC: Multiple Roles of a Conserved GAF Domain Tyrosine Residue in Cyanobacterial and Plant Phytochromes. Biochemistry 2005, 44(46):15203–15215. 10.1021/bi051633z
Article PubMed Central CAS PubMed Google Scholar
Tu SL, Rockwell NC, Lagarias JC, Fisher AJ: Insight into the Radical Mechanism of Phycocyanobilin-Ferredoxin Oxidoreductase (PcyA) Revealed by X-ray Crystallography and Biochemical Measurements. Biochemistry 2007, 46(6):1484–1494. 10.1021/bi062038f
Article CAS PubMed Google Scholar
Rockwell NC, Lagarias JC: The structure of phytochrome: a picture is worth a thousand spectra. Plant Cell 2006, 18(1):4–14. 10.1105/tpc.105.038513
Article PubMed Central CAS PubMed Google Scholar
Rockwell NC, Su YS, Lagarias JC: Phytochrome structure and signaling mechanisms. Annu Rev Plant Biol 2006, 57: 837–858. 10.1146/annurev.arplant.56.032604.144208
Article PubMed Central CAS PubMed Google Scholar
Frankenberg N, Mukougawa K, Kohchi T, Lagarias JC: Functional genomic analysis of the HY2 family of ferredoxin-dependent bilin reductases from oxygenic photosynthetic organisms. Plant Cell 2001, 13(4):965–978. 10.1105/tpc.13.4.965
Article PubMed Central CAS PubMed Google Scholar
Hagiwara Y, Sugishima M, Takahashi Y, Fukuyama K: Induced-fitting and electrostatic potential change of PcyA upon substrate binding demonstrated by the crystal structure of the substrate-free form. FEBS Lett 2006, 580(16):3823–3828. 10.1016/j.febslet.2006.05.075
Article CAS PubMed Google Scholar
Hagiwara Y, Sugishima M, Takahashi Y, Fukuyama K: Crystal structure of phycocyanobilin:ferredoxin oxidoreductase in complex with biliverdin IXalpha, a key enzyme in the biosynthesis of phycocyanobilin. Proc Natl Acad Sci U S A 2006, 103(1):27–32. 10.1073/pnas.0507266103
Article PubMed Central CAS PubMed Google Scholar
Lee DS, Flachsova E, Bodnarova M, Demeler B, Martasek P, Raman CS: Structural basis of hereditary coproporphyria. Proc Natl Acad Sci U S A 2005, 102(40):14232–14237. 10.1073/pnas.0506557102
Article PubMed Central CAS PubMed Google Scholar
Frishman D, Argos P: Knowledge-based protein secondary structure assignment. Proteins 1995, 23(4):566–579. 10.1002/prot.340230412
Article CAS PubMed Google Scholar
Wagner JR, Brunzelle JS, Forest KT, Vierstra RD: A light-sensing knot revealed by the structure of the chromophore-binding domain of phytochrome. Nature 2005, 438(7066):325–331. 10.1038/nature04118
Article CAS PubMed Google Scholar
Sugishima M, Omata Y, Kakuta Y, Sakamoto H, Noguchi M, Fukuyama K: Crystal structure of rat heme oxygenase-1 in complex with heme. FEBS Lett 2000, 471(1):61–66. 10.1016/S0014-5793(00)01353-3
Article CAS PubMed Google Scholar
Miller S, Lesk AM, Janin J, Chothia C: The accessible surface area and stability of oligomeric proteins. Nature 1987, 328(6133):834–836. 10.1038/328834a0
Article CAS PubMed Google Scholar
Levitt M: A simplified representation of protein conformations for rapid simulation of protein folding. J Mol Biol 1976, 104(1):59–107. 10.1016/0022-2836(76)90004-8
Article CAS PubMed Google Scholar
Nozaki Y, Tanford C: The solubility of amino acids and two glycine peptides in aqueous ethanol and dioxane solutions. Establishment of a hydrophobicity scale. J Biol Chem 1971, 246(7):2211–2217.
CAS PubMed Google Scholar
Pickett SD, Sternberg MJ: Empirical scale of side-chain conformational entropy in protein folding. J Mol Biol 1993, 231(3):825–839. 10.1006/jmbi.1993.1329
Article CAS PubMed Google Scholar
Zamyatnin AA: Amino acid, peptide, and protein volume in solution. Annu Rev Biophys Bioeng 1984, 13: 145–165. 10.1146/annurev.bb.13.060184.001045
Article CAS PubMed Google Scholar
McCaldon P, Argos P: Oligopeptide biases in protein sequences and their use in predicting protein coding regions in nucleotide sequences. Proteins 1988, 4(2):99–122. 10.1002/prot.340040204
Article CAS PubMed Google Scholar

Download references

Acknowledgements

The authors wish to thank Yong Duan and Hongxing Lei for insightful collaboration on a homology modeling project that spurred development of homolmapper and Shih-Long Tu and Keenan Taylor for its evaluation. We also wish to thank the reviewers for their helpful suggestions. This work was supported by a grant from the National Institutes of Health (GM068552) to J.C.L. and by a subcontract from the National Science Foundation Center for Biophotonics Science and Technology PHY-0120999.

Author information

Authors and Affiliations

Section of Molecular and Cellular Biology, University of California, Davis, California, 95616, USA
Nathan C Rockwell & J Clark Lagarias

Authors

Nathan C Rockwell
View author publications
You can also search for this author in PubMed Google Scholar
J Clark Lagarias
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nathan C Rockwell.

Additional information

Authors' contributions

NCR developed the software and wrote the documentation. JCL aided in development and in drafting the manuscript. Both authors read and approved the final manuscript.

Electronic supplementary material

12859_2007_1495_MOESM1_ESM.gz

Additional file 1: The standard homolmapper distribution (current as of March 2007), containing homolmapper itself and associated utilities and accessory files. (GZ 129 KB)

12859_2007_1495_MOESM2_ESM.gz

Additional file 2: Alternative homolmapper distribution for operation as a batch file under Windows, current as of March 2007. (GZ 131 KB)

12859_2007_1495_MOESM3_ESM.gz

Additional file 3: Additional documentation for homolmapper, including the PDF User Guide and example structure, alignment, and accessory files, current as of March 2007. (GZ 5 MB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Rockwell, N.C., Lagarias, J.C. Flexible mapping of homology onto structure with Homolmapper. BMC Bioinformatics 8, 123 (2007). https://doi.org/10.1186/1471-2105-8-123

Download citation

Received: 19 January 2007
Accepted: 11 April 2007
Published: 11 April 2007
DOI: https://doi.org/10.1186/1471-2105-8-123

Flexible mapping of homology onto structure with Homolmapper

Abstract

Background

Results

Conclusion

Similar content being viewed by others

Background

Implementation and results

Rationale for program operation

Overall architecture

Matching the structure to the alignment

Scoring schemes and reference sequences

Implementation of mutations, residue highlights, and multi-residue motifs

Handling of subfamilies

Mutual information analysis

Using a PSSM in lieu of an MSA

Handling of non-canonical residues

User extensibility

Documentation

Discussion

Planned future development

Intended uses

Conclusion

Availability and requirements

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation