SNP2SIM: a modular workflow for standardizing molecular simulation and functional analysis of protein variants
Molecular simulations are used to provide insight into protein structure and dynamics, and have the potential to provide important context when predicting the impact of sequence variation on protein function. In addition to understanding molecular mechanisms and interactions on the atomic scale, translational applications of those approaches include drug screening, development of novel molecular therapies, and targeted treatment planning. Supporting the continued development of these applications, we have developed the SNP2SIM workflow that generates reproducible molecular dynamics and molecular docking simulations for downstream functional variant analysis. The Python workflow utilizes molecular dynamics software (NAMD (Phillips et al., J Comput Chem 26(16):1781-802, 2005), VMD (Humphrey et al., J Mol Graph 14(1):33-8, 27-8, 1996)) to generate variant specific scaffolds for simulated small molecule docking (AutoDock Vina (Trott and Olson, J Comput Chem 31(2):455-61, 2010)).
SNP2SIM is composed of three independent modules that can be used sequentially to generate the variant scaffolds of missense protein variants from the wildtype protein structure. The workflow first generates the mutant structure and configuration files required to execute molecular dynamics simulations of solvated protein variant structures. The resulting trajectories are clustered based on the structural diversity of residues involved in ligand binding to produce one or more variant scaffolds of the protein structure. Finally, these unique structural conformations are bound to small molecule ligand libraries to predict variant induced changes to drug binding relative to the wildtype protein structure.
SNP2SIM provides a platform to apply molecular simulation based functional analysis of sequence variation in the protein targets of small molecule therapies. In addition to simplifying the simulation of variant specific drug interactions, the workflow enables large scale computational mutagenesis by controlling the parameterization of molecular simulations across multiple users or distributed computing infrastructures. This enables the parallelization of the computationally intensive molecular simulations to be aggregated for downstream functional analysis, and facilitates comparing various simulation options, such as the specific residues used to define structural variant clusters. The Python scripts that implement the SNP2SIM workflow are available (SNP2SIM Repository. https://github.com/mccoymd/SNP2SIM, Accessed 2019 February ), and individual SNP2SIM modules are available as apps on the Seven Bridges Cancer Genomics Cloud (Lau et al., Cancer Res 77(21):e3-e6, 2017; Cancer Genomics Cloud [www.cancergenomicscloud.org; Accessed 2018 November]).
KeywordsMolecular dynamics Ligand docking Protein structure Functional prediction
Nanoscale Molecular Dynamics
Programmed death ligand 1
Root mean square deviation
Visual Molecular Dynamics
Molecular simulation is a powerful tool used by computational biologists to analyze the relationship between protein structure and its functional properties. Ranging from high throughput drug screening to focused characterization of protein conformational dynamics, the creative analysis has several translational applications. Large libraries of drug candidates can be evaluated to produce novel targeted therapeutics, and insight into specific molecular interactions between effective drugs and their protein targets aids the design novel molecules [1, 2]. An advantage of the computational simulations is the ability to probe how variation in the protein sequence alters those molecular interactions, and can be extended to the development of therapies targeted at specific sequence variants [3, 4, 5, 6]. In addition to drug discovery and design, the insight can be further extended to inform treatment planning when selecting an optimal targeted therapeutic strategy .
Due to an inherent tradeoff between resolution and computational requirements, molecular simulations can be divided between approaches which only simulate a fraction of the overall molecule and those which explicitly consider all atomic interactions occurring within a solvated system. Coarse grained methods which do not explicitly consider the internal interactions occurring within the protein backbone are used to address the enormous search space that must be sampled when predicting how two molecules interact . For example, predicting how well a small molecule ligand will bind to a target protein depends on the sum total of all the individual atomic interactions. Depending on the chemical nature of the ligand, the conformational diversity can be quite large due to rotation around individual bonds and limited steric constraints of a single ligand molecule. Furthermore, the protein surface represents a large area of potential interactions and exponentially increases the degrees of freedom which must be explored when identifying an optimally bound structure. In order to simplify the search for optimized protein:ligand conformations and to simulate high throughput binding of large libraries of low molecular weight ligands, coarse grained docking methods will typically only model the flexibility of the ligand and a small number of interacting protein residues within a defined area of a rigid protein structure .
While the liberties taken by these types of simulations allow for a greater throughput, they fail to account for internal protein dynamics which may play a significant role in the interacting complex. All-atom molecular dynamics (MD) simulations explicitly account for atomic interactions occurring within a molecular system and provide a way to understand the overall conformational flexibility and structural dynamics . However, even systems consisting of a small, solvated protein contain tens to hundreds of thousands of atoms and each simulation step requires a summation of all the forces acting on each. Even on high performance computational infrastructures, simulation runs can easily last weeks to generate usable results. The increased computing cost is offset by its unique insight and characterization of functionally relevant protein dynamics.
Both approaches find utility in specific applications, and their individual strengths are leveraged to understand the impact on protein sequence variation on small molecule binding. Upon mutation of a residue, the change in the amino acid side chain has the potential to alter the functional interactions with a small molecule. If the change occurs within the defined search space of a coarse grained binding simulation, the new interactions can be simulated directly. Typically, the structures used for binding simulations are derived from x-ray crystallography, but simply swapping out amino acid side chains in the intersecting residues may not fully account for the structural differences of the protein variant. Since the protein backbone is treated as a rigid scaffold, the predicted binding characteristics do not account for those subtle changes in the backbone geometry and could have a large influence on the results. Furthermore, these methods have nothing to offer if the variation occurs outside of the defined search space, especially those amino acids which are buried within the folded protein structure. MD simulations can address this limitation by comprehensively sampling the conformational landscape of a protein variant to generate characteristic scaffolds for downstream small molecule docking.
At its core, SNP2SIM is a modular set of simulation and analysis tools wrapped in a command line Python script. There are many molecular dynamics simulations packages available, and the backend of the SNP2SIM workflow is designed to easily incorporate additional simulation packages in the future to customize the workflow and better accommodate user preferences. This initial implementation is built around Nanoscale Molecular Dynamics (NAMD)  and Visual Molecular Dynamics (VMD)  due to their scalability, interoperability, and implementation across a wide range of high performance computing infrastructures and operating systems. VMD is also used to process the results of the NAMD simulations and cluster the resulting trajectories according to structural variation in the protein:ligand binding interface. A representative conformation from each cluster is chosen to create a set of variant specific protein structures that reflect the subtle changes to its conformational diversity. AutoDock Vina  is used to perform the small molecule docking, and was selected due to its widespread use, ease of implementation within the workflow, and computational performance.
Starting with only a PDB formatted file of the protein structure, three independently run functional modules perform the molecular dynamics simulation of a protein variant, cluster of the resulting trajectories based on conformational variation in user defined binding residues, and dock small molecule ligands into each variant specific structural scaffolds. The workflow is designed to be used as a tool to aid large scale computational mutagenesis studies, enabling uniform application of simulation and analysis parameters. SNP2SIM minimizes the simulation options exposed to the user to control the generation of tool specific preprocessing and analysis scripts, define the parametrization options used in the configuration files, and output simulation results into a predefined file structure. The standardized file structure and naming conventions provide the option to implement the modules across independent computational systems and easily aggregate the results for downstream analysis.
The command line implementation of SNP2SIM is available for download from a GitHub repository , and the varMDsim, varScaffold, and drugSearch modules are also available as apps on the Seven Bridges Cancer Genomics Cloud [14, 15]. Due to the nature of MD simulations, the computational requirements of the workflow are dependent on the overall size of the protein structure and can grow to become quite significant, even on high performance infrastructures.
With the minimal input of a PDB formatted protein structure file and simulation time in nanoseconds, the varMDsim module will generate a solvated, ionized water box around a mutated protein structure, create the configuration files for the all-atom, explicit solvent simulation with periodic boundary conditions, and compile the results for downstream analysis. Utilizing the VMD Mutator, Solvate, and Autoionize plugins, the workflow will automatically mutate the input structure prior to solvation. The CHARMM36 force field  is used to parameterize the protein structure, and water molecules use the TIP3P water model. The simulation configuration files are hardcoded into the workflow, standardizing the resulting simulation for reuse and promoting the reproducibility of the computational simulations.
The run length of simulations is highly dependent on the nature of the protein under study, and can become significant for highly dynamic or large structures. However, since the aim is to capture subtle, variant induced changes to the conformational dynamics of the ligand binding interface, the structural diversity should be sufficiently sampled after hundreds of nanoseconds. Since SNP2SIM is configured to run the version of NAMD (including those utilize GPUs) installed on the user system, the varMDsim module can first be applied to benchmark performance.
The simulation trajectories are analyzed using the varScaffold module to produce characteristic structures of protein variants. More than merely clustering the collection of protein structures from the MD simulations, varScaffold first aligns the entire set to a common reference frame (typically over the entire protein structure) before measuring the root mean square deviation (RMSD) in the backbone of a subset of amino acids involved in ligand binding. Using the VMD “measure cluster” command, where a user supplied RMSD threshold is used to identify the 5 most populated configurations of the binding residue geometry. If one of the clusters is assigned a significant portion of the overall population of simulated results, a representative structure is chosen as a variant scaffold for downstream ligand binding.
The varScaffold module will accept multiple PDB or DCD formatted trajectory files generated through parallel execution of the varMDsim module. Since the clusters are determined using a relatively small number of residues, the number of populated clusters is very sensitive to the RMSD threshold. The workflow enables the iterative application of clustering parameters, allowing the user to specify which binding residues are used to define the binding interface geometry and determine the optimal RMSD cutoff before applying the module to the entire variant population.
The drugSearch module uses AutoDock Vina  to bind a library of low molecular weight molecules into the variant scaffolds. Unlike the previous modules which are largely automated, the configuration of the drugSearch module requires the user to define the ligand binding site on a reference structure. This requires the user to supply a PDB formatted protein structure (typically the structure used to initiate the varMDsim module), and an associated parameter file that defines the coordinates and dimensions of the search space. Additionally, the user can specify a set of residues within that search space model with flexible sidechains. These search parameters can be determined using the AutoDockTools software package, which accompanies the AutoDock Vina distribution.
The drugSearch module streamlines the process of ligand screening by aligning the individual variant scaffolds to the reference coordinates, generates the AutoDock Vina structural input and associated configuration files, and sequentially predicts the binding interactions and energies for individual ligands in the specified drug library. Several large libraries of ligands from The National Cancer Institute Developmental Therapeutics Program (Diversity Set 5, Mechanistic Set 3, and Natural Products Set 4) are included in the SNP2SIM repository, and additional libraries can be easily incorporated. The drugSearch module outputs the coordinates and binding energies for the top 9 high affinity poses for each small molecule.
The immunomodulatory protein programmed death ligand 1 (PD-L1) was used to demonstrate a typical application of the SNP2SIM workflow to drug development in immunotherapy. In some cancers, overexpression of PD-L1 leads to inactivation of the immune cells that attack the tumor, leading to the development of small molecule inhibitors that selectively inhibit PD-L1 interactions [17, 18, 19, 20]. To understand how these molecules may differentially bind to variants of PD-L1, known mutations in the binding domain were processed through the SNP2SIM workflow. The initial starting structure used the Ig-like V-type domain from PDB: 4Z18, and 5, 100 ns simulations were generated for a set of protein variant found in common experimental cell lines, as well as those most commonly occurring across all cancer types (L53P, V68 L, L94 M, G95R, A97V, M115 T) . Variant trajectories were aligned using the entire domain backbone and clusters were defined using a 0.7 Angstrom RMSD cluster threshold for the backbone atoms in residues interacting with low molecular weight inhibitors in PDB crystal structures [17, 18, 19, 20] (Residues 19, 20 54, 56, 66, 68, 115, 116, 117, 121, 122, 123, 124, 125). These same interacting residues were also modeled with flexible side chain torsions. The SNP2SIM workflow was run using the Seven Bridges Cancer Genomics Cloud infrastructure [14, 15], and the files needed to run this example are provided in the SNP2SIM code repository .
The growing prevalence of genomic testing is revealing an enormous amount of rare variants with unknown functional significance , underscoring the need for predictive computational analysis to determine their biological impact. This is especially true for variants which occur in proteins where the effectiveness of targeted therapeutic strategies may be disrupted. For example, missense mutations that emerge in response to evolutionary pressures in a growing tumor to disrupt binding of targeted inhibitor molecules . SNP2SIM enables the profiling of multiple approved inhibitors to inform the selection or design of an optimal therapy that maintains a positive clinical response .
By simulating the variant specific contributions to the overall protein conformational dynamics and ligand binding, the unique impact of a variant can be quantified even when the mutated residues do not occur at the interaction interface. This offers an advantage over using the crystal structure as the basis for small molecule docking simulations, instead providing a set of structures that is specific to the impact of the given variant. This is significant, as MD can capture conformational states not represented in crystal structures . Even for the wildtype structure, two populated conformations were identified which show slightly modified geometries of the protein backbone found in the crystal structure.
The SNP2SIM workflow represents a higher resolution approach to in silico ligand binding. Instead of using a single structure derived from crystallography experiments, a set of variant specific scaffolds are used to predict the binging affinity to small molecule ligands. The additional information on protein dynamics will ultimately produce more robust analysis and improve predictive models used for downstream drug development, design, and utilization. While the current iteration of SNP2SIM only manages the execution of the simulation workflow, predicative models can be built that integrate the data on the population (Fig. 2), structural divergence (Fig. 3), and binding interactions (Fig. 4).
The utility of a predictive, simulation based model, and the insight it can provide to understanding the functional changes of protein sequence variants, is rate-limited by computational costs and scale of potential variation. PD-L1 was chosen because it presented an optimal development case, where the size and structural stability helped to minimize the computational time required by the MD simulations. When simulated larger domains, such as folds that result in the ATP binding pocket in protein kinases, the computational requirements to generate relevant simulation timescales can grow to become prohibitive. These barriers are being overcome through access to cheap cloud computing and the development of reproducible workflows that can integrate standardized results from multiple research groups. And while a lot has been done to lower the barrier for novice users to access these tools through widely available infrastructure such as the NCI cloud pilots, creating an easy-to-use simulation and analysis workflow opens the doors to many researchers who would otherwise not have access. SNP2SIM ensures a uniform generation of input files, application of simulation parameters, and quantification of the results, and enables the parallel implementation of molecular simulations across hardware infrastructure.
Availability and requirements
Project name: SNP2SIM.
Project home page: https://github.com/mccoymd/SNP2SIM
Operating system: Linux.
Programming language: Python.
Other requirements: Nanoscale Molecular Dynamics (NAMD), Visual Molecular Dynamics (VMD), AutoDock Vina, AutoDock Tools.
Any restrictions to use by non-academics: Yes, subject to license and usage agreements for simulation software packages.
This research benefited from the use of credits from the National Institutes of Health (NIH) Cloud Credits Model Pilot, a component of the NIH Big Data to Knowledge (BD2K) program.
NIH/NHGRI 3U41HG007822-02S1, NIH Cloud Credits, NIH/NHLBI R01HL105239. The funding bodies did not contribute to the design of the study, the collection, analysis and interpretation of the data, nor the writing of the manuscript.
Availability of data and materials
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.
MM and MJ developed the underlying methods using simulations for variant analysis, while MM and VS developed the code for the simulation. SN provided the motiving use case and data for development and testing, and SM advised on the overall development and translational application. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 13.SNP2SIM Repository [https://github.com/mccoymd/SNP2SIM; Accessed 2019 February].
- 15.Cancer Genomics Cloud [www.cancergenomicscloud.org; Accessed 2018 November].
- 16.Huang J, MacKerell AD, Jr. CHARMM36 all-atom additive protein force field: validation based on comparison to NMR data. J Comput Chem 2013;34(25):2135–2145.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.