Key words

1 Introduction

Quite some years ago, in the earlier days of structural bioinformatics, among the hot topics were the prediction of protein three-dimensional (3D) structure, exploration of the protein fold universe, and understanding the molecular evolution of proteins. For all these studies, the recognition of distant homologues or analogous folds via protein-to-protein structural similarity is necessary. Efficient and useful protein-to-protein structure comparison methods were developed, such as DALI, SSAP, Vector Alignment Search Tool (VAST), CE, MATRAS, TM-align, and others [1,2,3,4,5,6]. Of course, it was also clear that the comparison of biological assemblies of proteins (and including other molecule types) to one another is important, although the range of complexity of available biological assemblies was limited in the early days of the Protein Data Bank (PDB). Comparison of assemblies is necessary for the study of molecular interactions and interfaces in atomic-level detail. Good computing performance is highly desirable, as the public structure database PDB right now contains nearly 150,000 structures and continues to grow.

Although it may seem that the comparison of biological molecular assemblies is not much more complicated than the comparison of two molecules, there is an extra degree of complexity introduced because assemblies can be very large, including several thousand or more residues/nucleotides, much larger than size of typical protein–protein comparisons. The fact that the molecules cannot be ordered in any canonical way is another complication. The MM-align algorithm (MultiMer-align [7]) solves these problems by using dynamic programming to achieve good performance and handles the comparison of assemblies by essentially comparing many pairs of large individual “artificial” molecules. The ordering problem is taken care of by considering every possible order of the protein chains, which results in many large individual and “artificial” proteins. The SCPC (Structural Comparison of Protein Complexes [8]) method detects similarities between substructures using secondary structure elements (SSEs). The individual protein chains are compared, using the SSE decompositions. This gives a collection of similar pairs, which are then further agglomerated into larger similar substructures, using a scoring function that cross-checks SSE positions across the constituent pairs. There is also 3D Complex, which represents assemblies by graphs and uses a graph matching algorithm to assess similarity between assemblies [9], producing a hierarchical classification of complexes.

In this chapter, we describe VAST+, a biological assembly comparison algorithm [10] that uses our original VAST [3] protein–protein comparison method. As will be apparent, there is no strict dependence on the VAST algorithm per se; any other pairwise-molecule structure comparison method could be used to provide the needed input to the VAST+ program. Besides the identification of similarities between biological assemblies, we have also made an effort to detect and emphasize meaningful dissimilarities, as described in the methods section about “refined alignment.” By considering dissimilarities between assemblies, one can more easily visualize state transitions and important conformational changes. In general, there will be many dissimilarities between assemblies that differ in many details, and automatically detecting and annotating those that may be biologically significant is a major challenge.

2 Methods

2.1 Clustering by Rotation Matrix Distance

Here is the intuition behind the method. In Fig. 1, there are two “assemblies,” ABC (blue) and a similar copy ABC (red). If we do a superposition of the A’s only, then we translate the center of mass (centroid) of the red-A to the centroid of the blue-A and rotate, R1. The amount of rotation is given by the angle between two lines: one line is determined by the centroid of the blue-A and the centroid of the blue assembly, c1; the other line goes through the centroid of the red-A and the centroid of the red assembly, c2.

Fig. 1
figure 1

The angle between the lines gives the rotation required to superpose both the red-A and blue-A and also the entire “assemblies” red-ABC and blue-ABC

To superpose the two assemblies, i.e., red-ABC and blue-ABC simultaneously, we translate c2 to c1 and then rotate (R2); again, the degree of rotation is determined by the angle between the translated lines. A translation does not change the angle between lines, and therefore we see that R1 must be equal to R2.

Of course, this is harder to visualize in 3D, but the same argument applies. The three superpositions of red-A and blue-A, red-B and blue-B, and red-C and blue-C represent the comparisons between single protein chains. The superposition of red-ABC to blue-ABC corresponds to the comparison of two molecular assemblies.

There is only one problem that arises, which is shown in Fig. 2. In this situation, we see that the rotation matrix for the A’s, after translation, is the identity matrix and likewise for the B’s. But the “interface” between A and B in the two assemblies is different! To handle this inconvenience, we simply introduce another numerical restriction, the “orientation check.” Namely, in order to cluster rot(A, A′) and rot(B, B′) together, not only do we require that the Euclidean distance between rot(A, A′) and rot(B, B′) be small enough, but also the vector from the centroid of A to the centroid of B needs to be in roughly the same direction as the corresponding vector between A′ and B′, after the translation and rotation. In the example in Fig. 2, these vectors point in opposite directions.

Fig. 2
figure 2

The rotations to superpose red-A with blue-A and red-B with blue-B are both the identity rotation, but red-AB and blue-AB cannot be superposed together because of a difference in orientation

To determine a reasonable clustering threshold, we took random pairs of arbitrarily chosen rotation matrices obtained from structure comparisons and calculated the Euclidean distance between them. The mean distance was about 2.4, and only about 1.7% of the pairs had a distance less than 1.0. We chose 1.0 as a threshold; this is the simple Euclidean distance in nine dimensions and there are no units.

In outline, the algorithm is thus as follows. Given two protein assemblies to compare, first compute all pairwise structural alignments between the component protein chains. For each pair of similar proteins, there is a rotation matrix for the superposition. The rotation matrices for the pairs are also going to include, with minor deviations, the rotations that superpose entire similar subassemblies, by the preceding argument. Cluster the rotation matrices using complete-linkage clustering, with Euclidean distance between the matrices (i.e., distance <1.0), and using the “orientation check.” The resulting clusters correspond to mappings between protein chains in one assembly and protein chains in the other assembly. An overall alignment for the assemblies can be obtained by simply combining the pairwise structural alignments that we started with for the component chains. In the next section, we will refer to this as an “original alignment” of the assemblies. We can get a superposition of the assemblies via the original alignment. Each of the clusters gives a different alignment; from among these various possibilities, choose one according to whatever desirable criteria. It makes the most sense to consider the clusters with the most protein chains aligned (i.e., the largest clusters), and then further select among these, e.g., by largest total number of residues aligned or smallest root mean square deviation (RMSD; all RMSDs referred to are superposition RMSDs, i.e., obtained after optimal 3D superposition).

Consider the case of comparing two hemoglobin tetramers. Each has four chains, ABCD and A′B′C′D′. The chains are all pairwise comparable, so there are rotation matrices rot(A, A′), rot(A, B′), rot(A, C′), etc., which are 16 in all. After the clustering, we get four clusters of size 4, and these correspond exactly to all the possible alignments of the tetramers. Some of these will map the alpha chains to alpha and beta to beta, and some will map alpha to beta and beta to alpha. The ones mapping alpha to alpha and beta to beta will have slightly larger alignments with a better RMSD, and one will be chosen as the representative assembly alignment, correctly.

If two biological assemblies include nucleotide chains, with structural alignments computed between them, then there is no problem in applying the rotation matrix clustering to include the alignments between the nucleotide chains. Thus, the method is easily extended to more complicated molecular assemblies such as ribosomes.

Figure 3 displays the number of similar assemblies in the VAST+ database, in terms of dimer-dimer, trimer-trimer, etc., on up through 24-mer to 24-mer similarities.

Fig. 3
figure 3

Counts of similar assemblies in the VAST+ database. The x-axis contains the number of molecules in the complexes from dimers to 24-mers. The y-axis is on a logarithmic scale, because the dimer-dimer, trimer-trimer, and tetramer-tetramer counts are so dominant

2.2 Refined Alignment

As you can guess, the assembly superposition obtained by using the pairwise alignments/superpositions of the component molecules, i.e., the “original” alignment, amounts to an average superposition. In many, or even most, cases, this is satisfactory, e.g., for closely related structures, the RMSD of the assembly superposition may be 1.0 Å or less, which is better than the resolution of almost all structures in the database. However, in some cases, the average superposition obscures our understanding of the assembly similarity. This will be apparent in the examples given below.

The purpose of the “refined alignment” algorithm is to process the original alignment by trimming it, in order to more easily view any biologically relevant differences between the assemblies. The trimmed pieces, which deviate from a “common core” of the two assemblies, are the positions where the conformations differ to a greater degree. Here is a brief description of the refined alignment algorithm. We formulate it as an optimization problem, where the scoring function is

$$ f(S)=N-E $$

Here S is the set of atoms (C-alphas, really paired/aligned atoms) that are included in the current alignment. The current alignment is a subset of the original alignment. Then N is simply the size of S, and E is the error term. We define the error term E as a sum of indexed error terms E(i, j):

$$ E=\sum (S)\ E\left(i,j\right) $$

The “∑(S)” means “sum over all positions i, j in S,” and E(i, j) = Δ(ij) − T + 1 if Δ(ij) > T, where T is the tolerance. The deltas are given by

$$ \Delta \left(i,j\right)=\mid d\left(i,j\right)-d\left({i}^{\prime },{j}^{\prime}\right)\mid $$

where d(ij) = (Euclidean) distance between atoms i and j in structure 1, d(i′, j′) = distance between i′ and j′ in structure 2, and i is aligned with i′, j with j′. The deltas are measuring the difference in distance between equivalent pairs of positions in the two structures. It works well to choose the tolerance T to be the RMSD of the original alignment/superposition. If we have a current alignment S and add a residue position to it, then N will increase by 1, but the error term E will also increase, depending on how many deltas associated with the position are bigger than the original RMSD and also by how much. We can see that there is going to be a tendency to shrink the original alignment by removing “erroneous” residues.

The scoring function f(S) = N − E is like those appearing in the classical, hard optimization problems. When trying to maximize it, there is a tension between choosing atoms to add to S to increase N and choosing to increase the error term, E. If we set the error term, E(ij) = infinity whenever Δ(ij) > T, then the problem amounts to finding a maximum independent set (MIS) on a type of graph embedded in three dimensions. The MIS on general graphs is a classical NP-complete problem, and there is no known efficient algorithm to solve it. It is not at all obvious if the MIS for general graphs can be transformed to the problem involving our special class of graphs, so it is unclear whether our problem is NP-complete or not. Nonetheless, a reasonable approach to solving our problem is to use a heuristic. We use a Monte Carlo-type algorithm, a Gibbs sampler. To further simplify the problem, we distinguish the SSEs belonging to the structures. Conveniently, VAST computes the protein-to-protein alignments using SSEs. Corresponding to each aligned SSE, there is a contiguous segment associated with it, in each of the two structures, which may extend past the endpoints of the SSE and into adjacent loop regions. A “move” in the MC algorithm consists of replacing a segment by a different one. When we replace a segment in one structure, we also replace the corresponding segment in the other structure, so that the replacement is consistent with the original alignment. Notice that we can delete a segment simply by replacing it with the empty segment. The error term for replacing a segment is simply the sum of the error terms for the positions that are added, minus the sum of the error terms for the positions removed. Then the Gibbs sampler proceeds in the standard way: choose a segment at random, calculate the scores for all the possible replacements of that segment, assign each possible replacement the Boltzmann weight, and then choose the replacement segment probabilistically according to the weights [11].

3 Examples

An excellent example is provided by comparison between the R (relaxed) and T (tense) states of aspartate transcarbamoylase, e.g., PDB structures 4kh1 and 1rae. 4kh1 is the R-state structure of an ATCase from E. coli [12], whereas 1rae is a T-state CTP-ligated ATCase also from E. coli [13]. The ATCase assembly consists of two catalytic trimers and three regulatory dimers, for a total of 12 protein chains [14]. Comparing the individual protein chains pairwise, between the 4kh1 and 1rae structures, we see that the pairs with similar folds align very well and superpose with excellent RMSDs of under 1.5 Å.

By clustering the rotation matrices and simply combining all the alignments of the 12 component protein chains, we get an original alignment of the entire assembly involving 2613 residues with a 5.4 Å RMSD. Structure 3D graphical viewers such as iCn3D https://github.com/ncbi/icn3d may have an “alternate structure” feature to flip between the structures in the superposed state. When viewed in iCn3D at a good angle, by flipping between the structures, the relative motion of the more rigid halves of the assemblies is readily seen.

However, the refined alignment makes the differences between the T- and R-states even more apparent. For example, in hand, 4kh1 and 1rae, the refined alignment has 1113 aligned residues at a RMSD of 2.3 Å. The most striking difference between the T-state and the R-state structures is an expansion of about 11 Å along the threefold axis [14]. The refined alignment redefines the superposition relative to one of the catalytic trimers and includes only six chains, namely, a catalytic trimer and one chain from each of the three regulatory dimers. This helps to make the large-scale difference between the T- and R-states more obvious (see Fig. 4). When displayed using the iCn3D graphical viewer, alternating the structures with the “a” key now shows the lower part moving vertically relative to the fixed upper part. See Fig. 5 for an overview of how to access the VAST+ webserver using this particular example.

Fig. 4
figure 4

Refined alignment/superposition of aspartate transcarbamoylase T- and R-state structures (PDB accessions 1rae and 4kh1). The red part displays the six-chain subcomponents that are superposed, which includes one catalytic trimer and one chain from each of the three regulatory dimers

Fig. 5
figure 5

From the “VAST+ Similar Structures” webpage, one can enter the PDB accession for a structure of interest (topmost red circle). This goes to the main VAST+ page, which will present a list of similar assemblies in the bottom panel. There is a search box which can be used to filter the results using simple search terms, e.g., keywords appearing in the titles or taxonomy. In this example, we are specifically interested in the PDB accession 1rae and so enter it in the search box (middle red circle). When we expand the entry for that result, we see the list of protein chains from each structure (4kh1 and 1rae), and the ones that correspond in the precomputed superposition are highlighted in the same color. An interactive graphical view can be obtained by launching the iCn3D viewer (bottom red circle)

The refinement algorithm can also detect much finer-grained conformational differences. The hemagglutinin influenza virus example involves PDB structures 3sdy and 1mqm. The 3sdy structure is a broadly neutralizing antibody bound to the influenza A H3 hemagglutinin [15]. The 1mqm structure is the hemagglutinin for a potential avian progenitor of the 1968 Hong Kong pandemic influenza virus [16]. All of the protein-to-protein superpositions between the hemagglutinin molecules are excellent, complete chain alignments at under 1.0 Å RMSD. When we combine all these to get our original alignment, we have 1470 residues aligned with an excellent 1.5 Å RMSD. The refined alignment is trimmed to 865 residues at 0.6 Å RMSD. As in the aspartate transcarbamoylase example, we see that regions involving larger-scale movements become unaligned, i.e., are trimmed (see Fig. 6). Again in Fig. 6, the red parts are the aligned and superposed refined alignment, and the white parts are unaligned in the refinement. The head of the hemagglutinin assembly, which is that part which opens upon membrane fusion with the cell the virus is trying to infect, is not included in the refined alignment because there is a conformational difference. Note that the white chain traces at the top and right-hand side, and in the background, are the antibodies in the 3sdy structure, which are not present in 1mqm and hence not included in the original alignment. However, besides the change at the head of the hemagglutinins, finer-grained details are also detected, such as the three B-loops. The B-loops change conformation in order to deliver the fusion peptide [15]. This is a good example, where the protein-to-protein alignments are excellent and align complete chains, but by comparing the assemblies and using the refinement method, we can detect subtle but important dissimilarities that are biologically relevant.

Fig. 6
figure 6

Refined alignment of influenza virus hemagglutinin assemblies; PDB accessions 3sdy and 1mqm. One of the B-loops is highlighted in green. The head and B-loop annotations were added manually

4 Summary

The VAST+ algorithm is an efficient, simple, and elegant solution to the problem of comparing the atomic structures of biological assemblies. It uses the VAST protein–protein comparison method for the underlying structural alignments, but other methods could be used, as well. The original VAST was designed to detect similarity between protein folds and was not very concerned with subtle differences in the structures. By using VAST+ and comparing entire biological assemblies, we can automatically detect important and subtle biologically relevant differences in the structures. Moreover, as examples like the influenza hemagglutinin show, it is only by comparing entire assemblies that we can detect and clarify relationships between the structures.