Biological Assembly Comparison with VAST+

Madej, Thomas; Marchler-Bauer, Aron; Lanczycki, Christopher; Zhang, Dachuan; Bryant, Stephen H.

doi:10.1007/978-1-0716-0270-6_13

Thomas Madej³,
Aron Marchler-Bauer³,
Christopher Lanczycki³,
Dachuan Zhang³ &
…
Stephen H. Bryant³

Part of the book series: Methods in Molecular Biology ((MIMB,volume 2112))

3826 Accesses
7 Citations

Abstract

The VAST+ algorithm is an efficient, simple, and elegant solution to the problem of comparing the atomic structures of biological assemblies. Given two protein assemblies, it takes as input all the pairwise structural alignments of the component proteins. It then clusters the rotation matrices from the pairwise superpositions, with the clusters corresponding to subsets of the two assemblies that may be aligned and well superposed. It uses the Vector Alignment Search Tool (VAST) protein–protein comparison method for the input structural alignments, but other methods could be used, as well. From a chosen cluster, an “original” alignment for the assembly may be defined by simply combining the relevant input alignments. However, it is often useful to reduce/trim the original alignment, using a Monte Carlo refinement algorithm, which allows biologically relevant conformational differences to be more readily detected and observed. The method is easily extended to include RNA or DNA molecules. VAST+ results may be accessed via the URL https://www.ncbi.nlm.nih.gov/Structure, then entering a PDB accession or terms in the search box, and using the link [VAST+] in the upper right corner of the Structure Summary page.

You have full access to this open access chapter, Download protocol PDF

US-align: universal structure alignments of proteins, nucleic acids, and macromolecular complexes

Article 29 August 2022

Predicting the structure of large protein complexes using AlphaFold and Monte Carlo tree search

Article Open access 12 October 2022

Annotating Macromolecular Complexes in the Protein Data Bank: Improving the FAIRness of Structure Data

Article Open access 01 December 2023

Key words

1 Introduction

Quite some years ago, in the earlier days of structural bioinformatics, among the hot topics were the prediction of protein three-dimensional (3D) structure, exploration of the protein fold universe, and understanding the molecular evolution of proteins. For all these studies, the recognition of distant homologues or analogous folds via protein-to-protein structural similarity is necessary. Efficient and useful protein-to-protein structure comparison methods were developed, such as DALI, SSAP, Vector Alignment Search Tool (VAST), CE, MATRAS, TM-align, and others [1,2,3,4,5,6]. Of course, it was also clear that the comparison of biological assemblies of proteins (and including other molecule types) to one another is important, although the range of complexity of available biological assemblies was limited in the early days of the Protein Data Bank (PDB). Comparison of assemblies is necessary for the study of molecular interactions and interfaces in atomic-level detail. Good computing performance is highly desirable, as the public structure database PDB right now contains nearly 150,000 structures and continues to grow.

Although it may seem that the comparison of biological molecular assemblies is not much more complicated than the comparison of two molecules, there is an extra degree of complexity introduced because assemblies can be very large, including several thousand or more residues/nucleotides, much larger than size of typical protein–protein comparisons. The fact that the molecules cannot be ordered in any canonical way is another complication. The MM-align algorithm (MultiMer-align [7]) solves these problems by using dynamic programming to achieve good performance and handles the comparison of assemblies by essentially comparing many pairs of large individual “artificial” molecules. The ordering problem is taken care of by considering every possible order of the protein chains, which results in many large individual and “artificial” proteins. The SCPC (Structural Comparison of Protein Complexes [8]) method detects similarities between substructures using secondary structure elements (SSEs). The individual protein chains are compared, using the SSE decompositions. This gives a collection of similar pairs, which are then further agglomerated into larger similar substructures, using a scoring function that cross-checks SSE positions across the constituent pairs. There is also 3D Complex, which represents assemblies by graphs and uses a graph matching algorithm to assess similarity between assemblies [9], producing a hierarchical classification of complexes.

In this chapter, we describe VAST+, a biological assembly comparison algorithm [10] that uses our original VAST [3] protein–protein comparison method. As will be apparent, there is no strict dependence on the VAST algorithm per se; any other pairwise-molecule structure comparison method could be used to provide the needed input to the VAST+ program. Besides the identification of similarities between biological assemblies, we have also made an effort to detect and emphasize meaningful dissimilarities, as described in the methods section about “refined alignment.” By considering dissimilarities between assemblies, one can more easily visualize state transitions and important conformational changes. In general, there will be many dissimilarities between assemblies that differ in many details, and automatically detecting and annotating those that may be biologically significant is a major challenge.

2 Methods

2.1 Clustering by Rotation Matrix Distance

Here is the intuition behind the method. In Fig. 1, there are two “assemblies,” ABC (blue) and a similar copy ABC (red). If we do a superposition of the A’s only, then we translate the center of mass (centroid) of the red-A to the centroid of the blue-A and rotate, R1. The amount of rotation is given by the angle between two lines: one line is determined by the centroid of the blue-A and the centroid of the blue assembly, c1; the other line goes through the centroid of the red-A and the centroid of the red assembly, c2.

To superpose the two assemblies, i.e., red-ABC and blue-ABC simultaneously, we translate c2 to c1 and then rotate (R2); again, the degree of rotation is determined by the angle between the translated lines. A translation does not change the angle between lines, and therefore we see that R1 must be equal to R2.

Of course, this is harder to visualize in 3D, but the same argument applies. The three superpositions of red-A and blue-A, red-B and blue-B, and red-C and blue-C represent the comparisons between single protein chains. The superposition of red-ABC to blue-ABC corresponds to the comparison of two molecular assemblies.

There is only one problem that arises, which is shown in Fig. 2. In this situation, we see that the rotation matrix for the A’s, after translation, is the identity matrix and likewise for the B’s. But the “interface” between A and B in the two assemblies is different! To handle this inconvenience, we simply introduce another numerical restriction, the “orientation check.” Namely, in order to cluster rot(A, A′) and rot(B, B′) together, not only do we require that the Euclidean distance between rot(A, A′) and rot(B, B′) be small enough, but also the vector from the centroid of A to the centroid of B needs to be in roughly the same direction as the corresponding vector between A′ and B′, after the translation and rotation. In the example in Fig. 2, these vectors point in opposite directions.

To determine a reasonable clustering threshold, we took random pairs of arbitrarily chosen rotation matrices obtained from structure comparisons and calculated the Euclidean distance between them. The mean distance was about 2.4, and only about 1.7% of the pairs had a distance less than 1.0. We chose 1.0 as a threshold; this is the simple Euclidean distance in nine dimensions and there are no units.

In outline, the algorithm is thus as follows. Given two protein assemblies to compare, first compute all pairwise structural alignments between the component protein chains. For each pair of similar proteins, there is a rotation matrix for the superposition. The rotation matrices for the pairs are also going to include, with minor deviations, the rotations that superpose entire similar subassemblies, by the preceding argument. Cluster the rotation matrices using complete-linkage clustering, with Euclidean distance between the matrices (i.e., distance <1.0), and using the “orientation check.” The resulting clusters correspond to mappings between protein chains in one assembly and protein chains in the other assembly. An overall alignment for the assemblies can be obtained by simply combining the pairwise structural alignments that we started with for the component chains. In the next section, we will refer to this as an “original alignment” of the assemblies. We can get a superposition of the assemblies via the original alignment. Each of the clusters gives a different alignment; from among these various possibilities, choose one according to whatever desirable criteria. It makes the most sense to consider the clusters with the most protein chains aligned (i.e., the largest clusters), and then further select among these, e.g., by largest total number of residues aligned or smallest root mean square deviation (RMSD; all RMSDs referred to are superposition RMSDs, i.e., obtained after optimal 3D superposition).

Consider the case of comparing two hemoglobin tetramers. Each has four chains, ABCD and A′B′C′D′. The chains are all pairwise comparable, so there are rotation matrices rot(A, A′), rot(A, B′), rot(A, C′), etc., which are 16 in all. After the clustering, we get four clusters of size 4, and these correspond exactly to all the possible alignments of the tetramers. Some of these will map the alpha chains to alpha and beta to beta, and some will map alpha to beta and beta to alpha. The ones mapping alpha to alpha and beta to beta will have slightly larger alignments with a better RMSD, and one will be chosen as the representative assembly alignment, correctly.

If two biological assemblies include nucleotide chains, with structural alignments computed between them, then there is no problem in applying the rotation matrix clustering to include the alignments between the nucleotide chains. Thus, the method is easily extended to more complicated molecular assemblies such as ribosomes.

Figure 3 displays the number of similar assemblies in the VAST+ database, in terms of dimer-dimer, trimer-trimer, etc., on up through 24-mer to 24-mer similarities.

2.2 Refined Alignment

As you can guess, the assembly superposition obtained by using the pairwise alignments/superpositions of the component molecules, i.e., the “original” alignment, amounts to an average superposition. In many, or even most, cases, this is satisfactory, e.g., for closely related structures, the RMSD of the assembly superposition may be 1.0 Å or less, which is better than the resolution of almost all structures in the database. However, in some cases, the average superposition obscures our understanding of the assembly similarity. This will be apparent in the examples given below.

The purpose of the “refined alignment” algorithm is to process the original alignment by trimming it, in order to more easily view any biologically relevant differences between the assemblies. The trimmed pieces, which deviate from a “common core” of the two assemblies, are the positions where the conformations differ to a greater degree. Here is a brief description of the refined alignment algorithm. We formulate it as an optimization problem, where the scoring function is

$$ f(S)=N-E $$

Here S is the set of atoms (C-alphas, really paired/aligned atoms) that are included in the current alignment. The current alignment is a subset of the original alignment. Then N is simply the size of S, and E is the error term. We define the error term E as a sum of indexed error terms E(i, j):

$$ E=\sum (S)\ E\left(i,j\right) $$

The “∑(S)” means “sum over all positions i, j in S,” and E(i, j) = Δ(i, j) − T + 1 if Δ(i, j) > T, where T is the tolerance. The deltas are given by

$$ \Delta \left(i,j\right)=\mid d\left(i,j\right)-d\left({i}^{\prime },{j}^{\prime}\right)\mid $$

where d(i, j) = (Euclidean) distance between atoms i and j in structure 1, d(i′, j′) = distance between i′ and j′ in structure 2, and i is aligned with i′, j with j′. The deltas are measuring the difference in distance between equivalent pairs of positions in the two structures. It works well to choose the tolerance T to be the RMSD of the original alignment/superposition. If we have a current alignment S and add a residue position to it, then N will increase by 1, but the error term E will also increase, depending on how many deltas associated with the position are bigger than the original RMSD and also by how much. We can see that there is going to be a tendency to shrink the original alignment by removing “erroneous” residues.

The scoring function f(S) = N − E is like those appearing in the classical, hard optimization problems. When trying to maximize it, there is a tension between choosing atoms to add to S to increase N and choosing to increase the error term, E. If we set the error term, E(i, j) = infinity whenever Δ(i, j) > T, then the problem amounts to finding a maximum independent set (MIS) on a type of graph embedded in three dimensions. The MIS on general graphs is a classical NP-complete problem, and there is no known efficient algorithm to solve it. It is not at all obvious if the MIS for general graphs can be transformed to the problem involving our special class of graphs, so it is unclear whether our problem is NP-complete or not. Nonetheless, a reasonable approach to solving our problem is to use a heuristic. We use a Monte Carlo-type algorithm, a Gibbs sampler. To further simplify the problem, we distinguish the SSEs belonging to the structures. Conveniently, VAST computes the protein-to-protein alignments using SSEs. Corresponding to each aligned SSE, there is a contiguous segment associated with it, in each of the two structures, which may extend past the endpoints of the SSE and into adjacent loop regions. A “move” in the MC algorithm consists of replacing a segment by a different one. When we replace a segment in one structure, we also replace the corresponding segment in the other structure, so that the replacement is consistent with the original alignment. Notice that we can delete a segment simply by replacing it with the empty segment. The error term for replacing a segment is simply the sum of the error terms for the positions that are added, minus the sum of the error terms for the positions removed. Then the Gibbs sampler proceeds in the standard way: choose a segment at random, calculate the scores for all the possible replacements of that segment, assign each possible replacement the Boltzmann weight, and then choose the replacement segment probabilistically according to the weights [11].

3 Examples

An excellent example is provided by comparison between the R (relaxed) and T (tense) states of aspartate transcarbamoylase, e.g., PDB structures 4kh1 and 1rae. 4kh1 is the R-state structure of an ATCase from E. coli [12], whereas 1rae is a T-state CTP-ligated ATCase also from E. coli [13]. The ATCase assembly consists of two catalytic trimers and three regulatory dimers, for a total of 12 protein chains [14]. Comparing the individual protein chains pairwise, between the 4kh1 and 1rae structures, we see that the pairs with similar folds align very well and superpose with excellent RMSDs of under 1.5 Å.

By clustering the rotation matrices and simply combining all the alignments of the 12 component protein chains, we get an original alignment of the entire assembly involving 2613 residues with a 5.4 Å RMSD. Structure 3D graphical viewers such as iCn3D https://github.com/ncbi/icn3d may have an “alternate structure” feature to flip between the structures in the superposed state. When viewed in iCn3D at a good angle, by flipping between the structures, the relative motion of the more rigid halves of the assemblies is readily seen.

However, the refined alignment makes the differences between the T- and R-states even more apparent. For example, in hand, 4kh1 and 1rae, the refined alignment has 1113 aligned residues at a RMSD of 2.3 Å. The most striking difference between the T-state and the R-state structures is an expansion of about 11 Å along the threefold axis [14]. The refined alignment redefines the superposition relative to one of the catalytic trimers and includes only six chains, namely, a catalytic trimer and one chain from each of the three regulatory dimers. This helps to make the large-scale difference between the T- and R-states more obvious (see Fig. 4). When displayed using the iCn3D graphical viewer, alternating the structures with the “a” key now shows the lower part moving vertically relative to the fixed upper part. See Fig. 5 for an overview of how to access the VAST+ webserver using this particular example.

The refinement algorithm can also detect much finer-grained conformational differences. The hemagglutinin influenza virus example involves PDB structures 3sdy and 1mqm. The 3sdy structure is a broadly neutralizing antibody bound to the influenza A H3 hemagglutinin [15]. The 1mqm structure is the hemagglutinin for a potential avian progenitor of the 1968 Hong Kong pandemic influenza virus [16]. All of the protein-to-protein superpositions between the hemagglutinin molecules are excellent, complete chain alignments at under 1.0 Å RMSD. When we combine all these to get our original alignment, we have 1470 residues aligned with an excellent 1.5 Å RMSD. The refined alignment is trimmed to 865 residues at 0.6 Å RMSD. As in the aspartate transcarbamoylase example, we see that regions involving larger-scale movements become unaligned, i.e., are trimmed (see Fig. 6). Again in Fig. 6, the red parts are the aligned and superposed refined alignment, and the white parts are unaligned in the refinement. The head of the hemagglutinin assembly, which is that part which opens upon membrane fusion with the cell the virus is trying to infect, is not included in the refined alignment because there is a conformational difference. Note that the white chain traces at the top and right-hand side, and in the background, are the antibodies in the 3sdy structure, which are not present in 1mqm and hence not included in the original alignment. However, besides the change at the head of the hemagglutinins, finer-grained details are also detected, such as the three B-loops. The B-loops change conformation in order to deliver the fusion peptide [15]. This is a good example, where the protein-to-protein alignments are excellent and align complete chains, but by comparing the assemblies and using the refinement method, we can detect subtle but important dissimilarities that are biologically relevant.

4 Summary

The VAST+ algorithm is an efficient, simple, and elegant solution to the problem of comparing the atomic structures of biological assemblies. It uses the VAST protein–protein comparison method for the underlying structural alignments, but other methods could be used, as well. The original VAST was designed to detect similarity between protein folds and was not very concerned with subtle differences in the structures. By using VAST+ and comparing entire biological assemblies, we can automatically detect important and subtle biologically relevant differences in the structures. Moreover, as examples like the influenza hemagglutinin show, it is only by comparing entire assemblies that we can detect and clarify relationships between the structures.

References

Holm L, Sander C (1993) Protein structure comparison by alignment of distance matrices. J Mol Biol 233(1):123–138
Article CAS Google Scholar
Orengo CA, Taylor WR (1996) SSAP: sequential structure alignment program for protein structure comparison. Methods Enzymol 266:617–635
Article CAS Google Scholar
Gibrat JF, Madej T, Bryant SH (1996) Surprising similarities in structure comparison. Curr Opin Struct Biol 6(3):377–385
Article CAS Google Scholar
Shindyalov IN, Bourne PE (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 11(9):739–747
Article CAS Google Scholar
Kawabata T, Nishikawa K (2000) Protein structure comparison using the Markov transition model of evolution. Proteins 41(1):108–122
Article CAS Google Scholar
Zhang Y, Skolnick J (2005) TM-align: a protein structure alignment algorithm based on the TM-score. Nucleic Acids Res 33(7):2302–2309
Article CAS Google Scholar
Mukherjee S, Zhang Y (2009) MM-align: a quick algorithm for aligning multiple-chain protein complex structures using iterative dynamic programming. Nucleic Acids Res 37(11):e83
Article Google Scholar
Koike R, Ota M (2012) SCPC: a method to structurally compare protein complexes. Bioinformatics 28(3):324–330
Article CAS Google Scholar
Levy ED, Pereira-Leal JB, Chothia C, Teichmann SA (2006) 3D complex: a structural classification of protein complexes. PLoS Comput Biol 2(11):e155
Article Google Scholar
Madej T, Lanczycki CJ, Zhang D, Thiessen PA, Geer RC, Marchler-Bauer A, Bryant SH (2014) MMDB and VAST+: tracking structural similarities between macromolecular complexes. Nucleic Acids Res 42(Database issue):D297–D303
Article CAS Google Scholar
Tanner MA (1998) Tools for statistical inference: methods for the exploration of posterior distributions and likelihood functions, Springer series in statistics. Springer-Verlag, New York
Google Scholar
Cockrell GM, Zheng Y, Guo W, Peterson AW, Truong JK, Kantrowitz ER (2013) New paradigm for allosteric regulation of Escherichia coli aspartate transcarbamoylase. Biochemistry 52(45):8036–8047
Article CAS Google Scholar
Kosman RP, Gouaux JE, Lipscomb WN (1993) Crystal structure of CTP-ligated T state aspartate transcarbamoylase at 2.5 A resolution: implications for ATCase mutants and the mechanism of negative cooperativity. Proteins 15(2):147–176
Article CAS Google Scholar
Lipscomb WN, Kantrowitz ER (2012) Structure and mechanisms of Escherichia coli aspartate transcarbamoylase. Acc Chem Res 45(3):444–453
Article CAS Google Scholar
Ekiert DC, Friesen RH, Bhabha G, Kwaks T, Jongeneelen M, Yu W, Ophorst C, Cox CF, Korse HJ, Brandenburg B, Vogels R, Brakenhoff JP, Kompier R, Koldijk MH, Cornelissen LA, Poon LL, Peiris M, Koudstaal W, Wilson IA, Goudsmit J (2011) A highly conserved neutralizing epitope on group 2 influenza A viruses. Science 333(6044):843–850
Article CAS Google Scholar
Ha Y, Stevens DJ, Skehel JJ, Wiley DC (2003) X-ray structure of the hemagglutinin of a potential H3 avian progenitor of the 1968 Hong Kong pandemic influenza virus. Virology 309(2):209–218
Article CAS Google Scholar

Download references

Acknowledgments

This research was supported by the Intramural Research Program of the NIH, National Library of Medicine at National Institutes of Health/DHHS. Funding for open access charge: Intramural Research Program of the National Library of Medicine, National Institutes of Health.

Author information

Authors and Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Thomas Madej, Aron Marchler-Bauer, Christopher Lanczycki, Dachuan Zhang & Stephen H. Bryant

Authors

Thomas Madej
View author publications
You can also search for this author in PubMed Google Scholar
Aron Marchler-Bauer
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Lanczycki
View author publications
You can also search for this author in PubMed Google Scholar
Dachuan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Stephen H. Bryant
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Madej .

Editor information

Editors and Affiliations

Information Technology and Bionics, Pázmány Péter Catholic University, Budapest, Hungary
Zoltán Gáspári

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this protocol

Cite this protocol

Madej, T., Marchler-Bauer, A., Lanczycki, C., Zhang, D., Bryant, S.H. (2020). Biological Assembly Comparison with VAST+. In: Gáspári, Z. (eds) Structural Bioinformatics. Methods in Molecular Biology, vol 2112. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-0270-6_13

Download citation

DOI: https://doi.org/10.1007/978-1-0716-0270-6_13
Published: 01 February 2020
Publisher Name: Humana, New York, NY
Print ISBN: 978-1-0716-0269-0
Online ISBN: 978-1-0716-0270-6
eBook Packages: Springer Protocols

Publish with us

Policies and ethics