Protein Interfacial Pocket Engineering via Coupled Computational Filtering and Biological Focusing Criterion
- First Online:
- Cite this article as:
- Reza, F., Zuo, P. & Tian, J. Ann Biomed Eng (2007) 35: 1026. doi:10.1007/s10439-007-9316-8
- 304 Downloads
To engineer bio-macromolecular systems, protein–substrate interactions and their configurations need to be understood, harnessed, and utilized. Due to the inherent large numbers of combinatorial configurations and conformational complexity, methods that rely on heuristics or stochastics, such as practical computational filtering (CF) or biological focusing (BF) criterions, when used alone rarely yield insights into these complexes or successes in (re)designing them. Here we use a coupled CF–BF criterion upon an amenable interfacial pocket (IP) of a protein scaffold complexed with its substrate to undergo residue replacement and R-group refinement (R4) to filter out energetically unfavorable residues and R-group conformations, and focus in on those that are evolutionarily favorable. We show that this coupled filtering and focusing can efficiently provide a putative engineered IP candidate and validate it computationally and empirically. The CF–BF criterion may permit holistic understanding of the nuances of existing protein IPs and their scaffolds and facilitate bioengineering efforts to alter substrate specificity. Such approach may contribute to accelerated elucidation of engineering principles of bio-macromolecular systems.
KeywordsProtein scaffold Interfacial pocket DNA substrate Complex Residue replacement R-group refinement Computational bioengineering Synthetic systems biology
global minimum energy conformation
residue replacement R-group refinement
root mean square deviation
van der Waals
Computational Filtering (CF) Approaches
There are a number of CF approaches to perform R4 with the aforementioned energetic conditions under consideration. However, due to the rapidly increasing degrees of freedom at each residue, n, of the protein chain, coupled with the specific characteristics of the 20 amino acids that can be found at each position, an colossal combinatorial quagmire of 20n possibilities require modeling and analysis—and for an average-sized protein composed of 100 amino acids, simulating 20100 possible physical combinations exceeds the number of atoms in the known universe. Thus, the probability of the protein’s IP locating its native state by pursuing all these combinations is biologically infeasible (known as the Levinthal paradox)15 and computationally impractical (known as the Blind Watchmaker paradox).15 An exhaustive structural bioinformatics search for IP formation and end-state continues to be a challenge that is tackled using filtering, heuristics, homology, distributed computing, and high performance supercomputers with varied success.39
Heuristics are often helpful and necessary in undertaking R4 at the scale of IPs. For example, heuristics in genetic algorithms, mean field algorithms, constraint logic programming enumeration, or database search perform adequately under certain scenarios and assumptions and not as well with others.11 While the computational cost is lessened or efficiency increased compared to the exhaustive search, the quality, however, of the end solution may or may not be consistent rather than assuring that the particular IP R4 generated by the heuristic is located at or near the GMEC.
Homology can often aid in proper R4 as well. Here, informatics searches and interpolations from signature sequences of a few residues composing a key motif of IP, substrate, or both can provide clues for engineering. This can extend further to domain sampling of entire regions across the protein that compose the IP. While this may be effective in well-investigated and documented systems, those sequences or structures with no similarity or availability of such information can hinder this approach. Even with fertile sources, often the R4 is limited by what has been already observed to transpose well.26,27
In similar fashion, partitioning and docking can narrow the possibilities for R4.25 A collection of IP conformations can be generated that each present a different VDW, electrostatic profile, or desolvation cost. By docking this collection of IP conformations to the proper substrate, the affinity features of those subpartition of IPs that dock more readily can be gleaned. However, fully enumerating all the elements in this collection may be computationally difficult or biologically unsubstantiated.
Biological Focusing (BF) Approaches
Correspondingly, there are many BF approaches to perform R4 so that resulting possibilities are in or near the aforementioned energetic conditions, perhaps by virtue of the constraints and fitness requirements existing in and imposed by the biological environment.44,45 Here, the parallel processing nature of this environment may provide a natural, even advantageous, platform to evaluate the large combinatorial number of possibilities and interdependencies to be considered in a tractable manner. However, this evaluation is often performed in a stochastic, discovery-driven investigation using various mutagenesis techniques, recombination, and directed evolution among others to screen for high performing clones or select those that survive from a large starting population representing the number of possibilities.7
Stochastics are often necessary for R4 at a single position in the protein, let alone the half-dozen to a dozen residues that comprise some IPs.19 Consider a random mutagenesis methodology using mutagenic chemicals, wobble base PCR, or error prone PCR to incorporate mutations at the genetic level that will be selected or screened for the desired characteristics at the protein level. Though apparently misguided, it has been observed that non-obvious mutations can give rise to proteins with new characteristics.3
Another group of approaches to achieve R4 using biology relies on using the recombination of existing components in the system to generate new promising possibilities.43 Among these is incremental truncation to correlate the loss or gain of certain IP features and functions to the gene and protein truncation positions.31,38 There is also homologous gene shuffling to generate variants of the original IP from internal wellsprings of diversity.9
These external and internal sources of stochastics can be considered aspects of directed and simulated evolution, which mimic the fitness requirements, survival and natural selection, propagation and amplification and of individuals, or IPs, to evaluate massive potential-filled populations with desirable properties.19 However, these stochastic approaches rely on the robustness of this evolutionary condition to propagate order from randomness. In summary, the BF approaches are usually a compromise between the intended end IP and those that arise serendipitously or survive having unintended properties.
Coupling Computational and Biological Approaches: CF–BF
Materials and methods
Comparative protein candidate structures for R.PvuII
PvuII Native apo form restriction enzyme
Dimer (shown) with 157 residues per monomer
PvuII restriction enzyme with DNA substrate
As above with DNA base pairs
PvuII cytosine N-4 apo form Methyltransferase
Monomer (shown) with 323 residues
EcoRI Native apo form restriction enzyme
Dimer with 276 residues per monomer (shown)
EcoRI restriction enzyme with DNA substrate
As above with DNA base pairs
EcoRV Native apo form restriction enzyme
Dimer (shown) with 245 residues per monomer
EcoRV restriction enzyme with DNA substrate
As above with DNA base pairs
BamHI Native apo form restriction enzyme
Dimer (shown) with 213 residues per monomer
BamHI restriction enzyme with DNA substrate
As above with DNA base pairs
Primary structure (PS) BF analysis was performed by first querying the REBASE database for comparative protein candidates (1PVI.pdb, 1BOO.pdb, 1ERI.pdb, 4RVE.pdb, 1BHM.pdb) and extracting the source organism. For each protein and associated source organism, the identity of the corresponding oligonucleotide bases of DNA substrate and means of interaction was determined. The protein polypeptide sequences were retrieved from the Protein Data Bank for each .pdb entry and cross-checked with the SEQRES fields in the .pdb files. Furthermore, these sequences were used to construct a phylogenetic tree. Multiple sequence alignments of the sequences were generated using a pairwise alignment evolutionary distance matrix, neighbor-join clustering, and CLUSTALW algorithms.40
Secondary structure (SS) BF analysis was performed for the remaining proteins (1PVI.pdb, 1ERI.pdb, 4RVE.pdb, 1BHM.pdb) after PS BF by querying the Protein Data Bank for “Sequence Details” section to assign SS based on the .pdb structure file’s “Author” and domain assignment using the Structural Classification of Proteins (SCOP) backend database.28 The hydropathic profiles of each protein were determined using the Kyte–Doolittle method.22
Tertiary (TS) and quaternary (QS) structure BF analyses was performed for the remaining proteins (1PVI.pdb and 4RVE.pdb) after PS and SS BF using the open source PyMOL v.0.99 molecular graphics real-time visualization and manipulation software with embedded Python scripting and interpreter.10 Each .pdb file was loaded, preset to cartoon rendering of the polypeptide SS, TS, and QS, enabled main and side chain rendering of the oligonucleotide substrate, and directional coloring of the polypeptide chains using a spectral gamut ranging from cooler blue hues at the N-termini to warmer red hues at the C-termini. After isolating all atoms composing the hexameric recognition sequence on the sense oligonucleotide chain to serve as points of origin, the set of all residues within a 3.0 angstroms (Å) boundary from these points were selected, the average distance between hydrogen bond donor and acceptor atoms. This set was pruned of those atoms located at the origin and the antisense oligonucleotide chain, leaving the subset of atoms that were part of IP residues and R-groups within this boundary. Upon labeling, the polypeptide positions and residues at those positions were tabulated against the closest proximity oligonucleotide base. This process was repeated for the antisense oligonucleotide chain with similar results, due to the palindromic nature of the DNA substrates and homodimeric nature of the restriction endonucleases. In addition to these steric VDW calculations, qualitative vacuum electrostatics assessments of the protein IP and accessible surface for each structure were generated using a local protein contact potential, without solvent dielectrics, and with equilibrium charges and radii settings from the Assisted Model Building and Energy Refinement (AMBER 99) force field to evaluate IP charge complementarily to the substrate DNA.16 Given these sterics and electrostatics, the consensus participating positions from the acceptor IP were overlaid with the consensus participating residues from the donor IP to propose a putative engineered IP on the acceptor scaffold that binds the donor substrate.
Preliminary validation of the proposed putative engineered IP was carried out both computationally and empirically. Structural mutagenesis was performed on both monomers of R.PvuII-D using a PyMol-native rotamer library to generate the mutant homodimeric enzyme. Discrete mutant rotamers were auto-positioned based on calculated lowest energy, steric hindrance minimizing conformations, and then relaxed to assume along the same spatial direction as would be achieve by a natural, continuous R-group. Then, in one approach, GATATC DNA substrate coordinates were extracted from R.EcoRV-D, and in the other B-form DNA substrate coordinates was generated de novo using NUCGEN.5 Each GATATC substrate was then inserted into R.PvuII-D and affine space aligned along the CAGCTG DNA substrate using the shared second A and fifth T bases as spatial and directional coordinates of reference. Upon aligning, CAGCTG DNA substrate was deleted resulting in the R.PvuII Putative Engineered IP mutant-D, i.e., the mutant complexed with GATATC DNA substrate. Hydrogen bond and polar contact patterns between the IP residues and DNA substrate involved in recognition were calculated and compared for engineered and wild-type complexes.
Preliminary empirical validation was carried out using a cell survival assay. R.PvuII Putative Engineered IP mutant was synthesized de novo using a similar protocol as described in41 and the correct synthetic sequence was selected and confirmed by standard DNA sequencing technology. This synthetic gene was sub-cloned into the pET-21a expression vector (Novagen), which was then transformed into E. coli BL21 cells. Transformed cells were cultured with appropriate selection antibiotics in the presence or absence of IPTG, which induced expression of R.PvuII Putative Engineered IP mutant.
Results and discussions
Primary Structure (PS) BF
Origins and attributes of chosen proteins
5′-CAG ↓ CTG-3′
3′-GTC ↑ GAC-5′
E. coli RY13
5′-G ↓ AATTC-3′
3′-CTTAA ↑ G-5′
E. coli J62 pLG74
5′-GAT ↓ ATC-3′
3′-CTA ↑ TAG -5’
5′-G ↓ GATCC-3′
H (ATCC 49763)
3′-CCTAG ↑ G-5′
Secondary Structure (SS) and Hydropathy BF
The objective of BF on SS is to focus on those remaining candidates that have similar local hydropathic profiles and conformations of the IP polypeptide backbone, such as the alpha helix and beta sheet. In addition, given evolutionary fold conservation at protein IPs, such as active and allosteric sites, it may be worthwhile to compare how these local conformations interact with the DNA substrate. Also, this conservation may influence the choices made in R4 since certain residues are more capable in participating in particular SS, as they are able to adopt the necessary backbone dihedral angles. A mapping of these dihedral angles to the corresponding SS and capable residues can be found in a Ramachandran plot.
For this analysis, hydropathic profiles remained uninformative, but similarity in SS motif interactions to DNA grooves permitted further focusing (Supplementary Fig. 1). It was calculated that both R.EcoRI-N/D and R.BamHI-N/D tend to have a greater proportion and longer stretches of alpha helices (shown as waves) than beta sheets (shown as arrows), while both R.EcoRV-N/D and R.PvuII-N/D are more balanced in beta sheet content. Notably, both R.EcoRI-N/D and R.BamHI-N/D approach and recognize the DNA substrate from the major groove via an alpha helix and a loop and produce 5′ sticky ends, while both R.EcoRV-N/D and R.PvuII-N/D do so from the minor groove via a beta sheet and beta-like turn and produce blunt or 3′ sticky ends. Thus, this SS BF deemphasizes R.EcoRI-N/D and R.BamHI-N/D as sources for IP donation to R.PvuII-N/D, leaving R.EcoRV-N/D as a more biologically promising candidate.
Tertiary (TS) and Quaternary (QS) Structure BF
The objective of BF on TS and QS is to spatially align the remaining two well-focused candidates and readily identify the IP residues close enough to interact with the substrate. Given that both R.PvuII-N and R.EcoRV-N recognize and bind to a uniform, helical substrate, this can facilitate R4 by acting as a common coordinate reference from which residues from the donor IP can be mapped onto positions on the acceptor IP. A promising mapping can be confirmed using qualitative vacuum electrostatics assessments of the protein IP and accessible surface, since the engineered IP should have the electrostatic profile of the donor IP while the rest of the scaffold should remain as is.
Computational Validation via Structural Mutagenesis
Preliminary Empirical Validation via Cell Survival Assay
A cell survival assay was performed to determine whether the engineered R.PvuII has any enzymatic activity in cutting DNA. A synthetic R.PvuII Putative Engineered IP mutant was cloned into pET-21a expression vector and cultured on agar plate in the presence or absence of IPTG. Without IPTG, the cells grew normally and formed colonies; with IPTG, cells did not grow and no colonies were found on the plate (data not shown). The result of this cell survival assay suggested that the expression of the R.PvuII Putative Engineered IP mutant in E. coli lead to cell death presumably due to digestion of the host chromosomal DNA.4 Further characterization of the mutant enzyme is underway to determine it specificity activity and substrate specificity.
While CF–BF produced a promising putative engineered IP, the computational and empirical validations confirmed its properties to a certain degree. The computational validation of R.PvuII-D scaffold with the engineered IP examined the hydrogen bonding and polar contact patterns to show they, like the R.EcoRV-D IP, interact with all the bases in the substrate GATATC DNA. Preliminary empirical validation using the cell survival assay suggested that the R.PvuII Putative Engineered IP has enzymatic activity in cutting DNA. The engineered specificity remains to be determined by further biochemical assays. These validations, in turn, will inform and improve the CF–BF criterion through further iterations of this process for these types of nucleic acid binding enzymes.
The IP engineering possibilities are nearly as vast as the diversity of biology itself. Using CF alone, R4 of a hexameric DNA recognition site and a monomer of a dimer restriction enzyme, assuming that there is one residue interacting with each base of DNA and a limited number of rotamers in a library represents all possible R-group conformations of the 20 naturally occurring residues, would have required the evaluation of possible IP sequences equal to the size of the rotamer library to the sixth power in number. The CF–BF reduces this to a subset of the 20 naturally occurring residues (the subset being the 10 polar and charged residues in herein restriction endonuclease experiment) and R-group conformations, which are known to participate in the IP and interact with a similarly structured substrate. The benefits of successful IP engineering are equally numerous. Engineered IPs may lead to programmable proteins, such as restriction endonucleases that not only act as research tools, by enabling targeting, mapping and manipulation of genes and genomes, but as clinical technologies as well, by facilitating cleaving out a disease gene and repairing it with a working version in vivo.
This work was supported by a Biotechnology Predoctoral Training Fellowship to F.R. from NIH Grant GM08555 and partially by the Arnold and Mabel Beckman Foundation Young Investigator Award to J.T. Further support to F.R. was provided by the Duke University Institute for Genome Sciences and Policy, Computational Biology and Bioinformatics Program.