Structure-based prediction of protein-nucleic acid binding using graph neural networks

Sagendorf, Jared M.; Mitra, Raktim; Huang, Jiawei; Chen, Xiaojiang S.; Rohs, Remo

doi:10.1007/s12551-024-01201-w

Structure-based prediction of protein-nucleic acid binding using graph neural networks

Review
Open access
Published: 26 June 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Biophysical Reviews Aims and scope Submit manuscript

Structure-based prediction of protein-nucleic acid binding using graph neural networks

Download PDF

Jared M. Sagendorf¹^nAff6,
Raktim Mitra¹,
Jiawei Huang¹,
Xiaojiang S. Chen^2,3 &
…
Remo Rohs^1,3,4,5

574 Accesses
1 Citation
9 Altmetric
Explore all metrics

Abstract

Protein-nucleic acid (PNA) binding plays critical roles in the transcription, translation, regulation, and three-dimensional organization of the genome. Structural models of proteins bound to nucleic acids (NA) provide insights into the chemical, electrostatic, and geometric properties of the protein structure that give rise to NA binding but are scarce relative to models of unbound proteins. We developed a deep learning approach for predicting PNA binding given the unbound structure of a protein that we call PNAbind. Our method utilizes graph neural networks to encode the spatial distribution of physicochemical and geometric properties of protein structures that are predictive of NA binding. Using global physicochemical encodings, our models predict the overall binding function of a protein, and using local encodings, they predict the location of individual NA binding residues. Our models can discriminate between specificity for DNA or RNA binding, and we show that predictions made on computationally derived protein structures can be used to gain mechanistic understanding of chemical and structural features that determine NA recognition. Binding site predictions were validated against benchmark datasets, achieving AUROC scores in the range of 0.92–0.95. We applied our models to the HIV-1 restriction factor APOBEC3G and showed that our model predictions are consistent with and help explain experimental RNA binding data.

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Article 17 June 2024

Decoding the protein–ligand interactions using parallel graph neural networks

Article Open access 10 May 2022

Graph Neural Networks in Predicting Protein Function and Interactions

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

The ability of nucleic acid-binding proteins (NABP) to recognize and bind their targets is determined by the spatial arrangement of solvent-exposed side chains and corresponding geometric, chemical, and electrostatic properties of the protein structure. Structural models of protein-nucleic acid (PNA) complexes offer insights into physical mechanisms that underly PNA recognition but are difficult to obtain experimentally, and accurate de novo structural modeling methods are limited. Currently, approximately 49,000 protein structures are available in the protein data bank (PDB) (Berman et al. 2000) with annotated nucleic acid (NA) binding function, but only about 23% of such entries contain complexes of protein bound to NA.

In addition to experimentally determined structures, advances in protein structure prediction (Baek et al. 2021; Jumper et al. 2021; Ahdritz et al. 2024) offer vast amounts of accurately predicted protein structures that can be analyzed with computational methods. Knowledge of a protein’s binding function or which amino acid side chains participate in NA binding is valuable for target selection and drug design, can aid interpretation of biochemical data, or can be used as a source of prior information to improve modeling of PNA complexes. Very recently, generative methods such as AlphaFold3 (Abramson et al. 2024) and RoseTTAFoldNA (Baek et al. 2024) have been developed which can predict PNA complexes. Such methods can be used to directly infer NA binding sites and interactions; however, the accuracy of such inference depends on predicting both the protein structure and NA structure with high accuracy. Current methods report that accuracy is limited especially for protein-RNA prediction, and analysis has shown that such methods may have high false-positive rates (Schmid and Walter 2024), generating complexes with plausible stereochemistry but which do not form ex silico. Further, due to their complexity, such models are still of limited interpretability with respect to mechanistic understanding of PNA interactions. Therefore, at the present time, interpretable methods for predicting NA binding function and binding sites without the need for models of the PNA complex may be necessary and preferable.

Early analysis of experimentally determined structures of PNA complexes showed that certain physical characteristics of binding interfaces are common among NABP, such as enrichment of polar and positively charged sidechains (Jones et al. 1999; Nadassy et al. 1999; Draper 1999), highly positive electrostatic potential, and the hydrogen bonding proclivity of certain residue-nucleotide pairs (Mandel-Gutfreund et al. 1995). The observation that PNA binding sites contain physicochemical signatures with predictive power led to the development of many computational approaches for predicting NA binding from both protein sequence and structure. Several reviews have been published, and comparison studies have been performed which summarize much of this earlier work (Walia et al. 2012; Wang et al. 2020; Yan et al. 2015; Miao and Westhof 2015; Si et al. 2015).

Recently, structure-based approaches utilizing graph neural networks (GNN) (Zhou et al. 2018; Wu et al. 2020) have been developed for predicting protein function and binding (Gligorijevic et al. 2021; Gainza et al. 2020; Yuan et al. 2022; Xia et al. 2021; Tubiana et al. 2022; Krapp et al. 2023; Li and Liu 2023; Sverrisson et al. 2021), including NA binding. DeepFRI (Gligorijevic et al. 2021) is a method for predicting protein function from structure that constructs a graph representation of a protein based on C_α–C_α atom distances and utilizes a protein-sequence embedding model to generate features for each residue node in the graph. GraphSite (Yuan et al. 2022) utilizes a similar approach for predicting DNA binding sites but uses AlphaFold2 (Jumper et al. 2021) for its sequence embedding. GraphBind (Xia et al. 2021) is a GNN-based method that represents protein residues as nodes and constructs a dense graph within a sliding sphere of a fixed radius centered on each node. Sequence and structure features are assigned to each node, and residue-level DNA and RNA binding site predictions are then learned based on a latent spatial encoding. ScanNet (Tubiana et al. 2022) uses a point-cloud representation of a protein and learns an embedding of the spatiochemical arrangement of neighboring atoms and residues to predict protein-protein binding interfaces. PeSTo (Krapp et al. 2023) is a method for predicting protein-protein interfaces using a geometric transformer and only considers atomic coordinates and element type as input information. Their method has been applied to protein-ligand, PNA, protein-ion, and protein-lipid binding. GeoBind (Li and Liu 2023) utilizes quasi-geodesic convolutions over point clouds for DNA and RNA binding site prediction with features closely based on dMaSIF (Sverrisson et al. 2021). Other recent methods which do not utilize GNN but are closely related include MaSIF (Gainza et al. 2020) which predicts protein-protein binding interfaces and small-ligand binding sites and PST-PRNA (Li and Liu 2022) which predicts RNA bind sites. Finally, EquiPNAS (Roche et al. 2024) is a method that combines protein language models (pLMs) with an equivariant neural network architecture to predict PNA binding sites. It incorporates protein sequences, structural information, and information from a pre-trained protein language model.

We present PNAbind, a GNN-based method for predicting DNA and RNA binding function and binding sites from protein structure. In our approach, a protein structure is represented using a mesh discretization of the solvent-excluded molecular surface. The molecular surface is a convenient representation because it accurately captures important geometrical characteristics of a protein that are relevant for binding (e.g., binding pockets and shape complimentary of the binding site with a binding target) in a more efficient way compared to voxel or density-based representations. Geometric, electrostatic, evolutionary, and chemical properties of the protein structure are mapped onto the surface and included as features. Our GNN models learn encodings of the spatial arrangement of this information, which are then used for NA binding prediction. We utilize two variants of our models, each for one of two kinds of predictions—local predictions at the level of individual residues for identifying regions of a protein surface that constitute NA binding sites and global predictions at the level of an entire protein complex for the overall binding function of a protein assembly.

We show that our approach is applicable to proteins with a wide diversity in structural fold, size, and biological function. Our models predict DNA and RNA binding with high accuracy and can discriminate DNA versus RNA binding, which we show is achieved via distinct structural features related to binding mechanisms. Our models for binding site prediction were validated against benchmark datasets and compared with similar recent methods, achieving improvements on several metrics. We show that our method generalizes to both bound and unbound protein structures and produces very low false positive rates on a negative control dataset. Finally, we demonstrate that our model predictions are consistent with experimental RNA binding data for the deoxycytidine deaminase APOBEC3G which plays important roles in the restriction of the HIV-1 virus.

Methods

Mesh generation

We used NanoShaper (Decherchi and Rocchia 2013) for generating the molecular surface mesh and found that the “skin” surface (Edelsbrunner 1999) worked well in our study. The smoothness of the skin surface is controlled by a scalar “shrink factor,” which we set to 0.45, producing a surface that is smooth but captures geometrical features such as binding pockets. We use van der Waals parameters from the AMBER ff99 force field (Wang et al. 2000). Prior to mesh generation, missing atoms are added to the structure using PDB2PQR (Dolinsky et al. 2004). The AMBER99 parameter set defines a radius of 0 for some hydrogen atoms—in these cases, we set the radius to a minimum value of 0.6 Å.

Vertex features

A list of vertex features used in our study is given in Supplementary Table 1. Different features are computed at different resolutions before being mapped onto the surface mesh. For example, multiple-sequence alignment (MSA) features are computed at the level of individual residues, hydrogen bond donor/acceptor labels are assigned to individual atoms, and the surface mean curvature is computed for mesh vertices. Residue-level features are first mapped to child atoms, and atom-level features are mapped to vertices using a distance-weighted average, with weights given by the inverse Euclidean distance between the atom and each nearby vertex, for a maximum distance cutoff of 2.5 Å.

PH1 and PH2 refer to the electrostatic potential and the normal component of the electric field, respectively. The electrostatic potential is computed using TABI-PB (Geng and Krasny 2013), a boundary integral Poisson–Boltzmann solver. We set the bulk ion concentration to 0.15 M and the interior dielectric constant to 2.0. AF1–5 are numerical descriptors known as Atchley factors that describe a variety of physicochemical attributes such as polarity, secondary structure propensity, molecular volume, codon diversity, and electrostatic charge of amino acids. SAP is short for spatial aggregation propensity and measures hydrophobicity. HBA and HBD indicate proximity to a hydrogen bond acceptor or donor. CV1–3 is a geometric feature known as circular variance that roughly corresponds to the solid angle of a sphere (centered on each vertex) that is occupied by the molecular volume of the protein. For points which protrude from the surface, most of the sphere is unoccupied and the feature takes a value near zero. For points, which are buried deep within a pocket or channel, almost all the volume of the sphere is occupied, and the features take a value near 1. Each CV feature is defined with respect to a sphere of radius 7.5, 15.0, and 30.0 Å, respectively. MC stands for mean curvature, a standard geometric measure of curvature that is positive for convex surfaces and negative for concave surfaces. HK1–4 are heat kernel signatures (HKS), widely used shape descriptors defined in terms of eigenvectors and eigenvalues of the Laplace–Beltrami operator. They are mathematically related to the diffusion of heat over a surface and in practice are good descriptors of shape that are robust to small perturbations to the isometry of a surface. Each HKS feature encodes geometric information over difference distance scales as shown in Supplementary Fig. 1. PSSM1–20 are features derived from PSSM matrices computed by PSI-BLAST. Sequences of target chains were searched against the UniRef90 database with the following parameter values; E-limit of 10, E-value cutoff of 0.001, gap open penalty of 1, gap extend penalty of 1, and 3 iterations. HMM1–30 are features derived from profile HMMs computed by HHblits. Sequences of target chains were searched against the UniRef30 database using default run parameters.

Edge features

We use the same edge features described by Deng et al. (2018), which include the edge length (vertex-vertex distance), the angle an edge makes with the surface normal at each adjacent vertex, and the angle between the surface normal at adjacent vertices. These features encode rotationally invariant geometric information about how vertices within a radial neighborhood are distributed in space. For example, if all vertices lie in a plane, the normal-normal angles will be zero, and all edge-normal angles will be 90°. However, if the vertices lie on a sphere, then the normal-normal and edge-normal angles will vary as a function of distance in a systematic way.

Probabilistic binding model

Given a mesh representation of a protein molecular surface, $G=\{\mathbf{V}\in {\mathbb{R}}^{|V|\times 3},\mathbf{E}\in {\mathbb{N}}^{|E|\times 2}\}$ where $\mathbf{V}$ are the vertex coordinates in Euclidean space, $\mathbf{E}$ are pairs of indices into $\mathbf{V}$ that form directed edges corresponding to triangle faces, ${\mathbf{X}}_{V}$ are vertex features and ${\mathbf{X}}_{E}$ are edge features, we wish to either classify the entire graph based on the global aggregation of local physicochemical properties or to classify every vertex ${v}_{i}\in \mathbf{V}$ as a binding site or non-binding site based on features of that vertex and features of its local environment. Both approaches share similar frameworks, and we first describe the framework for vertex classification (i.e., segmentation).

Segmentation

We compute a per-vertex Bernoulli distribution $q({y}_{i}|G,{\mathbf{X}}_{V}, {\mathbf{X}}_{E};{\varvec{\uptheta}})\in [\text{0,1}]$ where ${y}_{i}$ is the class label of ${v}_{i}$; ${y}_{i}=1$ represents a binding site and ${y}_{i}=0$ a non-binding site. The distribution $q$ depends on a parameterization ${\varvec{\uptheta}}$ defined by the neural network. We note that ${\varvec{\uptheta}}$ does not depend on $i$—therefore the parameters are shared over all vertices in $G$.

Let $\mathcal{D}=\{{G}_{n},{\mathbf{p}}_{n}\}$ be a dataset of protein molecular surface graphs and ${\mathbf{p}}_{n}$ are the corresponding per-vertex binding probabilities, ${p}_{ni}\in \{\text{0,1}\}$. ${p}_{ni}=1$ if the $i$ th vertex lies in an observed NA binding site and is 0 otherwise. We learn $q$ by minimizing the cross-entropy of our predicted probabilities $\mathbf{q}$ with the observed probabilities ${\mathbf{p}}_{n}$ over all graph vertices. That is, we seek to find a set of parameters that satisfies the following condition,

$${{\varvec{\uptheta}}}^{*}=\underset{{\varvec{\uptheta}}}{\text{argmin}}\left[-\sum_{\mathcal{G}\in \mathcal{D}}\sum_{i\in \mathcal{G}}p({y}_{i}=1)\text{log }q({y}_{i}=1;{\varvec{\uptheta}})+p({y}_{i}=0)\text{log }q({y}_{i}=0;{\varvec{\uptheta}})\right]$$

Given an estimate for ${{\varvec{\uptheta}}}^{*}$, we can predict the probability for every vertex in $G$ to belong to a binding site or not, and assign labels to each vertex by choosing a threshold on ${q}_{i}$,

$${\widehat{y}}_{i}=\left\{\begin{array}{ll}1& \text{if }\hspace{0.17em}q({y}_{i}=1|G,{\mathbf{X}}_{\mathbf{V}},{\mathbf{X}}_{E};{{\varvec{\uptheta}}}^{*})\ge {\tau }_{p}\\ 0& {\text{else}}\end{array}\right.$$

where ${\widehat{y}}_{i}$ is the predicted label for the $i$ th vertex and ${\tau }_{p}\in [\text{0,1}]$ is a threshold value that can be chosen depending on the application. In practice ${{\varvec{\uptheta}}}^{*}$ is estimated via gradient descent.

Graph classification

To assign a probability to the entire graph, we modify our learned distribution to be of the form $q=q(y|G,{\mathbf{X}}_{V},{\mathbf{X}}_{E};{\varvec{\uptheta}})$ where $y$ is now a single binary class label for the entire graph, and our dataset is of the form $\mathcal{D}=\{{G}_{n},{p}_{n}\}$ where ${p}_{n}\in \{\text{0,1}\}$. The model parameters can be optimized in the same way described above by minimizing a cross-entropy loss function using gradient descent.

Probability thresholds

Assigning a class given a probability depends on choosing a threshold which is in practice somewhat arbitrary. For segmentation, due to the imbalanced nature of the datasets (statistics in Supplementary Fig. 4a), an unbiased prior of 0.5 may be a sub-optimal choice. We applied Platt scaling to each model (using validation folds for scaling parameter optimization) before combining probabilities in the ensemble and chose a probability threshold that maximized the F1 score of training set predictions, such that each ensemble model has an independent threshold assigned to it. We did not apply Platt scaling to binding function prediction models and chose a threshold of 0.5 for those models, as these datasets are well balanced.

Segmentation models produce vertex-level binding site probabilities. Residue-level probabilities are obtained by max pooling over all vertices that correspond to the solvent-excluded surface of each residue.

GNN model architecture

Segmentation

We utilize convolutional GNN to model per-vertex binding site probabilities $\mathbf{q}={\text{GNN}}(\mathbf{E},\mathbf{V},{\mathbf{X}}_{\mathbf{V}}, {\mathbf{X}}_{\mathbf{E}};{\varvec{\uptheta}})\in {\mathbb{R}}_{[\text{0,1}]}^{|V|\times 2}$. The network is based on a graph U-Net (Gao and Ji 2019) style architecture, combined with the iterative farthest point sampling (FPS) pooling and k-nearest neighbors (KNN) unpooling used in PointNet + + by Qi et al. (2017) The crystal-graph (CG) convolution introduced by Xie and Grossman (2018) was chosen for the convolution kernel for all convolutional layers. The network architecture is shown in Fig. 2b. A schematic overview of how convolutional, pooling and unpooling layers work is provided in Fig. 2c. First a uniformly distributed subset of vertices is sampled via FPS, with a sampling ratio of $\alpha =0.5$. Centered on each sampled vertex (referred to as a centroid), a directed radial graph is constructed by joining each centroid to all other vertices within a sphere of radius $r$. A message passing convolution operation is then performed over this radial graph. The incoming messages for every centroid are aggregated via a linear combination of min, max, mean, and stdev pooling, and the final activation for each centroid is given by the sum of the pooled messages and the current centroid activation. Updated centroid activations contain information about the spatial context of each centroid within the neighborhood defined by the radius $r$. Convolutions are repeated twice per layer, after which non-centroid vertices are removed from the graph and the centroids form a new, sparser set of vertices. The sampling/convolution/pooling operation is then repeated $d=3$ times. We choose increasing radii for each convolutional layer in the network with $r=5.0,\;7.5,\;10.0\;A{\kern-3.5pt}^{^{\circ}}$. To perform segmentation, pooled vertices must be interpolated back to restore the original resolution of the mesh, which is accomplished by KNN unpooling (Qi et al. 2017). Starting from the final coarsened layer ${\ell}={{\ell}}_{d}$, we interpolate features of pooled non-centroid vertices ${x}_{j}^{{{\ell}}_{d}+1}$ from the $k=3$ nearest centroid vertices as shown in Fig. 2d. The interpolated features are a distance-weighted mean of centroid features. Finally, these interpolated features are combined with the matching vertex activations from the corresponding convolution/pooling layer via skip-connections as shown in Fig. 2b. The skip connections improve gradient flow through the network and allow spatial context information from earlier layers to propagate forward more efficiently. The output of the final unpooling layer is concatenated with the full-resolution skip connections and passed through a two-layer MLP which acts like a 1 × 1 convolution to transform the activations to ${\mathbb{R}}^{2}$. After softmax normalization, the output of the model represents a Bernoulli distribution over each vertex.

Classification

The architecture of the classification model is identical to the segmentation model up to the final convolutional/pooling layer and is shown in Fig. 2a. In place of unpooling layers, a single global pooling layer is used to aggregate all information over the coarsened mesh to a fixed-dimensional activation. The global attention pooling is of the form

$$\mathbf{r}=\sum_{\text{i}\in V} \sigma (f\left({\mathbf{x}}_{\text{i}}\right)){\mathbf{x}}_{\text{i}}$$

where $f\left(\cdot \right)$ is an MLP with learnable weights and $\sigma \left(\cdot \right)$ is the sigmoid function.

Datasets

Classification

We developed three datasets for training models to predict DNA and RNA binding function using structures based on AlphaFold2 predictions. Three sets of proteins were identified in the Swiss-Prot knowledgebase (Bateman et al. 2021) using functional annotations—DNA binding proteins (DBP), RNA binding proteins (RBP), and non-binding proteins (nBP). DNA/RNA binding proteins were first identified by searching the database for sequence entries with the UniProtKB molecular function keyword “DNA-binding” or “RNA-binding.” Any protein sequence which contained both annotations was excluded, and sequences longer than 1000 amino acids or with an annotation score less than 2 were excluded. To identify nBP, sequence entries which did not contain the keyword “DNA-binding” or “RNA-binding” were first identified. Entries were then excluded which contained additional keywords related to NA binding such as “Nucleotide-binding,” “Repressor,” “Activator,” and “Ribonucleoprotein.” Additionally, Gene Ontology (GO) (Ashburner et al. 2000) molecular function annotations were also used for excluding entries that contained annotations related to NA binding. A list of more than 4400 GO terms was generated by traversing the GO annotation hierarchy starting from source terms such as “nucleotide binding,” “nucleic acid binding,” “nucleotide metabolic process,” and “nucleic acid metabolic process”. After filtering, AlphaFold2 (Jumper et al. 2021) derived structures were then obtained for the remaining sequences and a second round of filtering was performed based on the predicted structure quality. Low-confidence regions of the predicted structures were removed (confidence < 0.65), and only structures which remained as non-disjoint structure were kept. Finally, sequence clustering was performed on the remaining sequences at a threshold of 35% sequence similarity, and two sequences per cluster were sampled. The result was three sets of proteins which we could have high confidence in, both in the predicted structure quality and the accuracy of the function annotations. These were then used for the DnP-6784, RnP-6046, and RDP-6046 datasets. Supplementary Fig. 4 shows statistics regarding these datasets.

Segmentation

Seven benchmark structural datasets of PNA complexes were used in this study for training/testing segmentation models. One DBP dataset (DNA-573 (Xia et al. 2021)) and two RBP datasets (RBP09 (Li and Liu 2022), RNA-495 (Xia et al. 2021)) were used as training sets, and two DBP datasets (DNA-181 (Yuan et al. 2022), DNA-129 (Xia et al. 2021)) and two RBP datasets (RBP61 (Li and Liu 2022), RNA-117 (Xia et al. 2021)) were used as independent test sets. These datasets contain experimentally derived structural models of PNA complexes deposited in the PDB (Berman et al. 2003). Models trained on RBP09 were validated on RBP61, models trained on DNA-573 were validated on DNA-129 and DNA-181, and models trained on RNA-495 were validated on RNA-117. When using the biological assembly as the basic structural unit, any additional chains beside the target chain were masked out during training/testing for fair comparisons with previous studies. Supplementary Fig. 4 shows statistics regarding the class distribution and functional annotations of the proteins in these datasets. A negative control dataset was produced by taking the subset of the nBP dataset proteins described above which contained experimentally determined structures.

Model training procedure

Model parameters were determined using the ADAM optimizer (Kingma and Ba 2014) to minimize the cross-entropy between predicted probabilities and the ground-truth labels over the mesh vertices (segmentation) or entire mesh (classification). Fivefold cross-validation with early stopping was performed. For each fold, several duplicate models were trained with random initialization, and within each fold, the best model was selected.

Results

In the following sections, we first summarize the main design concepts and elements used in PNAbind. Next, we report on DNA and RNA binding function prediction using models trained on AlphaFold2-predicted protein structures. Our models predict DNA and RNA binding with AUROC values of 0.94 and 0.95, respectively, and discriminate DNA versus RNA binding with an AUROC value of 0.92. Applying attribution methods, we use these models to interpret structural features and regions of the protein structure responsible for binding function, and we highlight chemical and geometric differences that are important for discriminating DNA versus RNA binding. We then show results on NA binding site prediction (residue level predictions) using models trained on native protein structures. Our models achieve high accuracy across a variety of benchmark datasets with AUROC scores ranging from 0.916 to 0.953. We compare binding site predictions on native bound protein structures versus predicted unbound structures and show high correlations of model predictions. Finally, we apply our model to the deoxycytidine deaminase APOBEC3G and show that our predictions are consistent with what is known experimentally about the RNA binding properties of this protein.

PNAbind overview

PNAbind makes binding predictions based on the geometric, chemical, and electrostatic properties of protein molecular surfaces. Our models are based on GNN, a versatile neural network architecture designed to operate on graph domains (Zhou et al. 2018; Wu et al. 2020), and make inferences using a mesh discretization of the molecular surface. Our models predict the overall binding function (e.g., DNA-binding or RNA-binding) of a protein or protein assembly via graph classification and predict the location of NA binding sites on the protein surface via graph segmentation. By design, GNN automatically respect the permutation symmetry of graphs, and by choosing vertex and edge features that are invariant under translation and rotation, our models are completely invariant to all isometries of Euclidean space. An overview of our method is shown in Fig. 1, and further details of the network architectures are provided in Fig. 2.

The primary building blocks of our models are graph convolution, pooling and unpooling layers, followed by multilayer perceptron (MLP) modules (Methods). Graph convolutions are learned functions applied over the edges of a graph which aggregate information about the local graph structure. A visual example of graph convolutions we use is shown in Fig. 1c. The convolutions are performed over edges defined by a radial distance threshold, and the convolution kernel depends on edge features, which describe distance and angle relationships (Fig. 3a) and features of adjacent vertices. By aggregating the activations of the kernel function over the edges of vertex neighborhoods and applying the convolution repeatedly, spatial arrangements of physicochemical and geometric information that are indicative of NA binding can be encoded. Pooling is used to coarsen the mesh by progressively removing select vertices. It allows for increasing the radius of the next layer of convolutions without incurring a roughly cubic increase in the number of edges in the resulting graph. By increasing the radius, the model can learn encodings over different distance scales. Unpooling is only needed for binding site prediction and it serves to restore the mesh to its original resolution. The final global (local) encodings are then transformed via an MLP module, and the output of the networks represents probability distributions that are computed per-mesh (vertex) for binding function (site) prediction. The runtime of PNAbind is fast, as shown in Supplementary Fig. 5. We find that the median wall time to evaluate our segmentation model (inference) is less than one second per protein surface mesh.

Surface mesh features

We use a combination of features defined on graph edges and vertices that are derived from the atomic resolution protein structure, surface mesh, and sequence in our models. Edge features are used to describe geometric relationships over convolution neighborhoods, such as distance and relative orientation of surface normal vectors (Fig. 3a, Methods). Vertex features encode biophysical properties and evolutionary information that is mapped onto the protein surface, which we categorize into four groups. Chemical features describe hydrogen bond donors and acceptors, hydrophobicity, and physical properties of exposed side chains such as secondary structure propensity, charge, and polarity. Geometric features are descriptors of the molecular surface and protein structure that depend on its shape and topology. These features can encode distinct shapes that may be related to function such as binding pockets or helical DNA major and minor groove binding elements. Electrostatic features describe the local electrostatic environment of the solvated molecular surface based on the charge distribution and surface geometry. MSA features are based on PSSM matrices computed using PSI-BLAST(Altschul et al. 1997) and profile HMMs computed using HHblits (Remmert et al. 2012) (Methods). In total, 18 structure-based vertex features and four edge features are used in all our models as shown in Fig. 3b–d, and an additional 50 MSA features (20 PSSM + 30 HMM features) are used for binding site prediction. Some structural features depend on parameters that define a distance scale for that feature, and different values of these parameters are used to generate multi-scale features that range from high-variance features on the order of ~ 3 Å to low-variance features on the order of ~ 20 Å (Supplementary Fig. 1).

Prediction of DNA and RNA binding function

We developed three datasets for training models to predict DNA and RNA binding function using predicted AlphaFold2 structures available in the AlphaFold Protein Structure database (Varadi et al. 2021). Protein sequences in the UniProt database (UniProt 2023) with available AlphaFold2 structures were screened based on functional annotations and predicted structure quality and then clustered by sequence similarity to create a DNA binding protein (DBP), RNA binding protein (RBP), and non-binding protein (nBP) dataset (details and statistics are available in Methods and Supplementary Fig. 4b). Three sets of models were trained—one set to distinguish DBP versus nBP (DnP-6784), the second set to distinguish RBP versus nBP (RnP-6046), and a third set to distinguish DBP versus RBP (RDP-6046). Eighteen structure-based features were used in these models (MSA features used for binding site prediction were not included). For each set of models, five-fold cross-validation was performed with each holdout fold split into a 50/50 validation/test set and the validation set was used to determine early stopping. Average binary classification metrics on the test splits are shown in Table 1. Our models achieve high accuracy in distinguishing DBP and RBP from nBP with AUROC scores of 0.943 and 0.945, respectively. Our models also distinguish DBP from RBP with high accuracy, achieving an AUROC value of 0.920 despite these proteins binding targets with highly similar chemical structures. In Supplementary Table 2, we also explore the potential for distinguishing double-stranded DNA binding proteins (dsDBP) from single-stranded DNA binding proteins (ssDBP). Our results indicate high discriminative power, achieving performance metrics very similar to those in Table 1.

Table 1 Binary classification metrics of our models for predicting DNA binding, RNA binding, and DNA versus RNA binding function. Abbreviations stand for the area under receiver operating-characteristic curve (AUROC), area under precision-recall curve (AUPRC), balanced accuracy (BA), and Matthew’s correlation coefficient (MCC)

Full size table

Mechanistic interpretation of DNA versus RNA binding predictions

Motivated by the high accuracy of our models in distinguishing DNA binding from RNA binding, we sought to better understand how our models learn to separate these two classes of proteins. We present two methods of interpreting our model predictions, based on feature and spatial attribution, and relate these to underlying physical mechanisms that determine DNA and RNA binding.

Feature attribution

Feature attribution quantifies how much a particular feature or group of features determines the predictive capability of a model. In our approach, we use feature permutation (Fisher et al. 2019) whereby a group of features are permuted (along the vertex dimension of the surface mesh) such that the spatial distribution of the features over the protein surface is mixed up. A visual example of the permutation process is shown in Fig. 4a. An error measure is computed for the permuted and unpermuted input, and the error introduced by permuting a group of features is called the feature importance, with more important features producing larger errors upon permutation. Figure 4b shows the importance of feature groups for models trained on the three datasets described above, with AUROC used to measure decrease in performance. In all cases, a drop in performance is observed, indicating that all feature groups contribute some information related to binding function. On the task of predicting DBP versus nBP, hydrogen bond donor/acceptors (HB), Atchley factors (AF), and electrostatic potential features (EP) stand out as most important. For RBP versus nBP, AF and EP show high importance, with much less importance assigned to HB. In distinguishing DBP versus RBP, AF, HB, and EP features are all assigned high importance, but more strikingly geometric features (including edge features (EF)) show much higher relative importance, with the geometric shape descriptor HKS being assigned more importance than EP. This suggests that while electrostatic and chemical signatures may be sufficient to distinguish DNA and RNA binding from non-binding, the geometry of the protein surface is crucial for distinguishing DNA from RNA binding.

Spatial attribution

Additional insight can be gained by performing spatial attribution, which quantifies how strongly different regions of the protein molecular surface contribute to the predicted probability of a given binding class. Intuitively, regions of a protein structure that most affect model predictions should be related to the binding mechanism of the protein, and correlate with the NA binding site. We computed the spatial attribution via Grad-CAM (Selvaraju et al. 2017), an attribution method where gradients for a target class probability are computed with respect to the activations of a chosen convolutional layer (typically the final layer), and average-pooled along the feature map dimension to produce a localized score for each element of the activation. These scores quantify how strongly a region of the input affects the predicted probability of a target class. Regions of low attribution contribute little to the predicted probability of a given class, and regions of high attribution strongly determine the output probability. In Fig. 4c, we show a visual example of the attribution scores for the “DNA-binding” class from the DBP versus nBP model for the repressor CtsR (UniProt ID C3W947). The attribution scores for the same protein are shown in Supplementary Fig. 2a alongside the known DNA binding site (determined from a native co-crystal structure), showing that regions of high attribution are visually correlated with the DNA binding site. In Supplementary Fig. 2b, we show that, in general, regions of high spatial attribution for both NA binding classes across all models correspond to regions that lie predominantly within known NA binding sites. DBP and RBP in our AlphaFold2 datasets which had experimentally determined co-crystal structures available in the PDB were identified, and the binding sites of these proteins were labeled using the co-crystal. We then computed precision and recall curves based on the attribution scores (first normalized between zero and one). These curves show that while the high attribution regions have low recall, they attain good precision—meaning the high attribution regions reside predominantly within the observed binding sites. This also indicates only particular regions within the binding sites are necessary for binding prediction.

Next, we plotted the distribution of input features within high attribution regions for every test set protein, using attribution scores computed from each of the three classification models described above. The threshold for high attribution was arbitrarily set at the 75th percentile across all values for a given protein. Scores were computed for the class corresponding to the ground-truth label of each protein. High attribution feature distributions are shown in Fig. 4d for the DBP versus RBP model, and for all models and features in Supplementary Fig. 2c–d. Clear separation can be seen in the distributions of many features for each target class and model. We note that in almost every case, feature distributions with large separation between classes also correspond to feature groups which were assigned a high importance as shown in Fig. 4b. The high attribution feature distributions provide insight into how different features may be related to binding mechanisms. For the DBP versus RBP model, a clear separation is seen in the features SAP (hydrophobicity) and PH1 (electrostatic potential). A scatter plot of the covariation of these features in Fig. 4e shows that the high attribution regions of RBP surfaces tend to be both less hydrophilic but more electrostatically positive compared to DBP proteins. Lysine and arginine occur frequently in NA binding sites as both are positively charged and form favorable electrostatic interactions with the negatively charged phosphate groups in the NA backbone. However, the proportions of these side chains may differ as we observe in Fig. 4f. The high attribution regions of RBP show a higher lysine content, while those of DBP show a higher arginine content. Lysine is less hydrophilic than arginine, explaining the difference in hydrophobicity observed. We note that this observation is consistent with a meta-analysis of DNA and RNA binding residues performed by Zhang et al. (2019) indicating a marginal preference for lysine in RNA binding sites and a preference for arginine in DNA binding sites. The enhanced electrostatic potential seen for RBP may be necessary for stabilizing the more globular structure RNA tends to adopt in protein-RNA complexes (example in Fig. 5a), which would otherwise experience destabilizing electrostatic repulsion between phosphate groups of the RNA backbone.

Differences in geometric features are notable, as shown in two geometric feature distributions in Fig. 4g. These features are both measures of curvature defined over different distance scales. The DBP distribution shows a shift towards more positive mean curvature, which indicates that high attribution is assigned to regions with more convexity compared to RBP. The DBP distribution also shows a shift towards lower values of CV2 (circular variance with a radius of 15.0 Å; Methods). A low value of this feature indicates a point on the surface that is protruding outward relative to other regions of the surface within a chosen radius. Taken together, these distributions are consistent with protrusions from the molecular surface on the order of the width of the DNA major and minor grooves. We hypothesize that the differences in geometric features of DBP versus RBP within high attribution regions are related to the fact that DNA predominantly adopts a B-form double helix, while RNA folds into much more globular structures or adopts an A-form double helix, and these differences in the tertiary structure of DNA and RNA are reflected in the geometry of the complementary protein molecular surface. A similar conclusion was reached by Shazman et al. (2011) who performed an analysis based on the surface curvature of DNA and RNA binding proteins.

Binding site prediction

We trained several sets of models for residue-level binding site prediction, which were then validated on previously developed benchmark datasets. Three independent sets of models were trained using identical hyperparameters (Fig. 2; Methods) on DBP and RBP training sets (DNA-573 (Xia et al. 2021), RNA-495 (Xia et al. 2021), and RBP09 (Li and Liu 2022)—Supplementary Fig. 4a). Training was performed using fivefold cross-validation, and an ensemble consisting of the best model from each training fold was used for predictions. Trained models were validated on four test sets, two containing native DBP, and two native RBP (DNA-181 (Yuan et al. 2022), DNA-129 (Xia et al. 2021), RNA-117 (Xia et al. 2021), and RBP61 (Li and Liu 2022), respectively—Supplementary Fig. 4a). Each test set was designed previously to have minimal sequence overlap with the corresponding training set or to only include co-crystal structures determined more recently than any training set structures. Figure 5a shows a gallery of binding site predictions on examples drawn from the test sets, with corresponding metric values for those predictions. Our models accurately identify DNA and RNA binding sites for a wide variety of structural domains and can detect binding sites of proteins with specificity for single-stranded, double-stranded, and more complicated secondary structures. Supplementary Fig. 4c shows the diverse range of biological functions of proteins in the test sets.

Comparison with recent methods

Validation metrics on several test sets for our method and other recently developed methods are given in Table 2. We use the same residue labeling criteria as the cited studies. Namely, DNA-129, RNA-117, and DNA-181 binding site labels were originally constructed using interaction annotations from the BioLiP database (Xia et al. 2021; Yuan et al. 2022), and RBP61 residue labels were constructed based on a nucleotide-residue distance threshold of 5.0 Å (Li and Liu 2022). PNAbind achieves the highest AUROC value (ranging from 0.916 to 0.953), and the highest Matthews correlation coefficient (MCC) (ranging from 0.409 to 0.735) for all four test sets, and is highly competitive across other metrics.

Table 2 Validation metrics for PNAbind and other recent methods on four test sets used in both our study and the cited studies. Each test set was evaluated using a model trained on the corresponding training set used in the cited studies, namely DNA-573 for DNA-129 and DNA-181, RNA-495 for RNA-117, and RBP09 for RBP61. Metric values for other predictors were collected from author-reported values (Yuan et al. 2022; Xia et al. 2021; Li and Liu 2022, 2023; Roche et al. 2024). PNAbind performance is assessed for full biological assemblies and protein monomers, denoted as PNAbind^A and PNAbind^M, respectively. Abbreviations stand for the area under receiver operating-characteristic curve (AUROC), area under precision-recall curve (AUPRC), and Matthew’s correlation coefficient (MCC)

Full size table

Predictions made using protein assemblies are more accurate

In Table 2, two rows are given for PNAbind. The first shows validation metrics for models evaluated on the native experimental biological assembly of the protein, and the second on the monomeric structure (e.g., a single protein chain, irrespective of the native binding mode of the protein). On the benchmark datasets, the full-assembly predictions achieve consistently higher validation metrics than the monomeric predictions. To assess if the native assembly provides an inherent advantage over the monomeric structure for binding site prediction, we collated all surface residues common to both structural models. This ignores residues from the monomeric structure that participate in protein-protein interactions and become buried beneath the solvent-accessible surface in the assembly structure. Therefore, the collated residue sets provide a direct one-to-one comparison. The validation metrics on these collated datasets are shown in Fig. 5b. In agreement with the values in Table 2, the assembly structure consistently has better AUROC scores than the monomeric structure. This indicates that PNAbind can capture not only the local chemical, evolutionary, and geometrical properties of the protein but also the biological structural context in which a binding site might appear in relation to co-factors, dimerization etc. This suggests that, where available, the full native assembly should be used for binding site predictions.

Binding site predictions on AlphaFold predicted protein structures

Binding site datasets were evaluated using models trained and tested on native protein structures bound to their NA targets. However, the conformation of a protein may change upon binding, and here we address the question of how much conformational variation between the bound and unbound state of a protein affects the accuracy of our model predictions. Many NABP lack native structures in the unbound state, so we use computationally derived structures as a proxy for native unbound structures. We used AlphaFold2 predicted structures for proteins in the benchmark test sets for which UniProt sequence mappings were available. DNA and RNA binding residues in the unbound proteins were labeled via global sequence alignment with the bound native structure. The mapped binding site labels allowed us to estimate the agreement of predicted binding sites between bound and unbound structures. Our results are shown in Fig. 5c, where we plot the AUROC score for every pair of bound/unbound structures. Good agreement in AUROC is seen for both native and AlphaFold2-derived structures, with an overall Pearson correlation of 0.76. The mean AUROC for native bound DBP (RBP) was 0.945 (0.881), and for AlphaFold2 predicted DBP (RBP) was 0.924 (0.880), showing close agreement in the overall predictive performance. We note that in general, the sequence alignment between native and predicted structure is not perfect because the predicted structure corresponds to the wild-type sequence, while the sequence used in the experimental model may contain modifications to overcome solubility or crystallization difficulties, and we expect that these sequence modifications contribute in part to the small discrepancies observed between native bound and predicted unbound. Overall, however, the high correlation and similar predictive performance demonstrate that our method is applicable to both bound and unbound protein structures.

Negative control experiment

The training and testing datasets used for binding site prediction sample a wide variety of proteins with diverse structures and functional roles related to NA binding (Supplementary Fig. 4c). However, because the non-binding site regions are sampled only from structures of NABP, it is not clear how well the full domain of the non-binding feature space (e.g., the full space of protein structures which have no NA binding function) is sampled, and how the models will behave on this domain. To measure the performance of our models on our domain targets, we constructed a negative control dataset which consists of proteins with no known NA binding function (Methods). For these proteins, we consider any binding site prediction to be a false positive, and we assess the model performance on the negative control using a false positive rate (FPR).

The FPR rates on the negative control dataset compared the FPR on the benchmark test sets are shown in Fig. 5d. In each case, the FPR of the negative control is significantly lower than that observed for test set NABP. This indicates that our models perform better than expected on proteins which are outside the training domain (in a functional sense). We hypothesize that this is largely because our models capture the physical mechanisms of PNA binding, providing a high degree of generalizability. Supplementary Fig. 3 shows the FPR of each model for each residue type on the negative control and test sets. Variation is seen across the models and datasets, but in general, positively charged side chains (arginine, lysine, protonated histidine) and aromatic side chains (tyrosine, tryptophan, phenylalanine) have the highest FPR.

Feature ablation study

We performed a feature ablation study to measure the contribution of different groups of features to the accuracy of our models. Figure 5e shows fivefold cross-validated AUROC scores for two of the benchmark training datasets. The labels and counts of the feature groups are PSSM features (P, 20), HMM profile features (H, 30), chemical features (C, 8), geometrical features (G, 8), and electrostatic features (E, 2). The highest AUROC scores are achieved when all groups of features are included. Among the MSA features (P/H), the profile HMMs appear to be more informative for both DNA and RNA binding sites, but marginal performance improvement is achieved by including PSSM features.

Among the three subsets of structural features, CE features appear more informative in discriminating DNA/RNA binding sites from non-binding sites, consistent with the findings in Fig. 4b for models trained to predict binding function.

Case study—APOBEC3G dimerization and RNA binding

APOBEC3G (A3G) is a cytidine deaminase that catalyzes the conversion of cytosine to uracil on DNA and RNA substrates via a conserved zinc-coordinating motif that forms a catalytically active binding pocket. It is a member of the APOBEC3 (A3) family of proteins which play important roles in mammalian innate immune response against retroelements and retroviruses including HIV-1 (Xiao et al. 2016; Yang et al. 2020; Aydin et al. 2014). A3G has a strong affinity to bind RNA in multiple binding modes (Yang et al. 2020) and contains two domains known as CD1 and CD2. Yang et al. (2020) published the first structures of full-length A3G from monkey rhesus macaque, Macaca mulatt, in its homodimer configuration that is mediated through CD1-CD1 interactions (Fig. 6a). Although the protein was co-purified with bound RNA, they were unable to construct a structural RNA model based on the observed electron density. Using these recently determined A3G structures, we applied our models (trained on the RNA-495 dataset) to predict RNA binding in both the monomer and dimer conformation of A3G to determine if our models could help elucidate the role RNA binding plays in the dimerization of A3G.

Figure 6c demonstrates RNA binding site predictions on full-length A3G (CD1 + CD2) in the dimer and monomer configuration. Yang et al. solved three structures of the A3G homodimer at different pH (PDB IDs: 6P3X, 6P3Y, 6P3Z). The predictions shown are averaged over residue labels predicted from all three structures. Our models predict a large RNA binding region (marked region 1) that spans the CD1-CD1 dimerization interface in the dimer structure, which is not present in the monomer structure. This is consistent with the experimental observation that A3G dimerization is RNA-dependent (Yang et al. 2020) and agrees with the region Yang et al. hypothesized may form the dimerized RNA binding site (Yang et al., Fig. 2). Yang et al. mutated critical residues in the dimerization interface of A3G which were shown to disrupt the dimerization and reported their experimental data (Fig. 6b, adapted from Yang et al. (2020)). Their data demonstrates a dramatic decrease in RNA association for the constructs with dimerization-disrupting mutations (rM9, rM10, rM15). Our models also predict a second, smaller region of RNA binding (marked region 2) that is independent of dimerization (e.g., is predicted for both monomer and dimer structures) and suggests that A3G should still possess some RNA binding capacity even if dimer formation is disrupted. This is consistent with the results reported by Yang et al. who measured binding affinities for 50-nt ssRNA for all mutants listed in Fig. 6b.

Yang et al. performed mutational studies to identify residues which are relevant for RNA binding. They determined several residues that play a key role in RNA binding which overlap with our model predictions. In region 1, these are R24, I26, S28, W127, N176, N177, and K180, and in region 2, K63 and R69. These residues are indicated in the insets for regions 1 and 2 in Fig. 6c. The results by Yang et al. provide strong experimental corroboration of our model predictions for APOBEC3G RNA binding. Overall, our results suggest that RNA binding to region 1 and dimerization of A3G may be mutually dependent on each other, and loss of dimerization may consequently result in loss of RNA binding to the resulting half-site of region 1.

Discussion

PNAbind predicts global protein binding function (e.g., DNA-binding versus RNA-binding) and predicts individual residues that together form NA binding sites using surface-based representations of proteins. Our choice of features is motivated by aspects of the protein molecular surface geometry and chemistry related to known NA recognition mechanisms. The geometry of the protein molecular surface is an important feature to consider and has been underutilized in previous work. Structural motifs found within the binding interface may have geometries related to binding function, such as binding pockets that can capture nucleotides in ssDNA or binding elements suitable for insertion into the major or minor groove of dsDNA. Our results on discriminating DNA versus RNA binding function highlight the importance of geometric information, as our models show geometric features to be at least as important if not more so than the chemical and electrostatic features used.

Experimentally determined structures are currently required for training models to predict NA binding sites with high accuracy, as there are no reliable alternatives for labeling the binding site locations in a training set. Functional annotations, however, do not require any structural data and are much more abundant. It is noteworthy that our models trained on predicted protein structures (many of which have no experimental structural models) achieve both high accuracy in identifying NA binding and discriminating DNA from RNA binding, but also can partially identify the binding sites of these proteins as shown in Supplementary Fig. 2a. This is achieved simply by inspecting which regions of the protein structure most contributed to the model prediction, using well-established attribution methods (Selvaraju et al. 2017). This approach allows the models to be applied to individual proteins for better understanding of how the protein structure relates to the binding function of the protein, or to perform a large-scale analysis and compare differences between classes of proteins as we have done in this study. A major strength of our approach is that it achieves high accuracy of its predictions but also allows for interpretation of model predictions in terms of structural features and corresponding binding mechanisms.

A significant majority of known protein sequences are poorly annotated. With over 200 million predicted protein structures now available in the AlphaFold database (Varadi et al. 2021), PNAbind can be used to aid in high-throughput annotation of NABP. Our binding site prediction models provide high accuracy on a more specialized task and can be used to aid in interpreting biochemical data, identify functionally important residues, or provide prior information about binding sites for modeling of PNA complexes. We also note that, while we have focused exclusively on NA binding in this study, PNAbind is a structure-based approach and hence is quite general and can be applied to the prediction of other types of functional annotations or other protein-ligand binding sites. Our method provides a general way to characterize how structural properties of proteins are related to their biological function, and in principle can be applied to any class of proteins.

Data availability

Data, source code, and documentation for PNAbind are available at https://doi.org/10.5281/zenodo.11288475 and https://github.com/jaredsagendorf/pnabind

References

Abramson J, Adler J, Dunger J, Evans R, GreenT, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, Bodenstein SW, Evans DA, Hung C-C, O’Neill M, Reiman D, Tunyasuvunakool K, Wu Z, Žemgulytė A, Arvaniti E, Beattie C, Bertolli O, Bridgland A, Cherepanov A, Congreve M, Cowen-Rivers AI, Cowie A, Figurnov M, Fuchs FB, Gladman H, Jain R, Khan YA, Low CMR, Perlin K, Potapenko A, Savy P, Singh S, Stecula A, Thillaisundaram A, Tong C, Yakneen S, Zhong ED, Zielinski M, Žídek A, Bapst V, Kohli P, Jaderberg M, Hassabis D, Jumper JM (2024) Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. https://doi.org/10.1038/s41586-024-07487-w
Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, O’Donnell TJ, Berenberg D, Fisk I, Zanichelli N, Zhang B, Nowaczynski A, Wang B, Stepniewska-Dziubinska MM, Zhang S, Ojewole A, Guney ME, Biderman S, Watkins AM, Ra S, Lorenzo PR, Nivon L, Weitzner B, Ban Y-EA, Chen S, Zhang M, Li C, Song SL, He Y, Sorger PK, Mostaque E, Zhang Z, Bonneau R, AlQuraishi M (2024) OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods. https://doi.org/10.1038/s41592-024-02272-z
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402. https://doi.org/10.1093/nar/25.17.3389
Article CAS PubMed PubMed Central Google Scholar
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25:25–29. https://doi.org/10.1038/75556
Article CAS PubMed Google Scholar
Aydin H, Taylor MW, Lee JE (2014) Structure-guided analysis of the human APOBEC3-HIV restrictome. Structure 22:668–684. https://doi.org/10.1016/j.str.2014.02.011
Article CAS PubMed Google Scholar
Baek M, McHugh R, Anishchenko I, Jiang H, Baker D, Dimaio F (2024) Accurate prediction of protein–nucleic acid complexes using RoseTTAFoldNA. Nat Methods 21:117–121. https://doi.org/10.1038/s41592-023-02086-5
Article CAS PubMed Google Scholar
Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, Millán C, Park H, Adams C, Glassman CR, DeGiovanni A, Pereira JH, Rodrigues AV, van Dijk AA, Ebrecht AC, Opperman DJ, Sagmeister T, Buhlheller C, Pavkov-Keller T, Rathinaswamy MK, Dalwadi U, Yip CK, Burke JE, Garcia KC, Grishin NV, Adams PD, Read RJ, Baker D (2021) Accurate prediction of protein structures and interactions using a three-track neural network. Science eabj8754. https://doi.org/10.1126/science.abj8754
Bateman A, Martin M-J, Orchard S, Magrane M, Agivetova R, Ahmad S, Alpi E, Bowler-Barnett EH, Britto R, Bursteinas B, A-Jee HB, Coetzee R, Cukura A, Da Silva A, Denny P, Dogan T, Ebenezer T, Fan J, Castro LG, Garmiri P, Georghiou G, Gonzales L, Hatton-Ellis E, Hussein A, Ignatchenko A, Insana G, Ishtiaq R, Jokinen P, Joshi V, Jyothi D, Lock A, Lopez R, Luciani A, Luo J, Lussi Y, MacDougall A, Madeira F, Mahmoudy M, Menchi M, Mishra A, Moulang K, Nightingale A, Oliveira CS, Pundir S, Qi G, Raj S, Rice D, Lopez MR, Saidi R, Sampson J, Sawford T, Speretta E, Turner E, Tyagi N, Vasudev P, Volynkin V, Warner K, Watkins X, Zaru R, Zellner H, Bridge A, Poux S, Redaschi N, Aimo L, Argoud-Puy G, Auchincloss A, Axelsen K, Bansal P, Baratin D, Blatter M-C, Bolleman J, Boutet E, Breuza L, Casals-CasasC, de Castro E, Echioukh KC, Coudert E, Cuche B, Doche M, Dornevil D, Estreicher A, Famiglietti ML, Feuermann M, Gasteiger E, Gehant S, Gerritsen V, Gos A, Gruaz-Gumowski N, Hinz U, Hulo C, Hyka-Nouspikel N, Jungo F, Keller G, Kerhornou A, Lara V, Le Mercier P, Lieberherr D, Lombardot T, Martin X, Masson P, Morgat A, Neto TB, Paesano S, Pedruzzi I, Pilbout S, Pourcel L, Pozzato M, Pruess M, Rivoire C, Sigrist C, Sonesson K, Stutz A, Sundaram S, Tognolli M, Verbregue L, Wu CH, Arighi CN, Arminski L, Chen C, Chen Y, Garavelli JS, Huang H, Laiho K, McGarvey P, Natale DA, Ross K, Vinayaka CR, Wang Q, Wang Y, Yeh L-S, Zhang J, Ruch P, Teodoro D (2021) UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 49:D480–D489. https://doi.org/10.1093/nar/gkaa1100
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE (2000) The Protein Data Bank. Nucleic Acids Res 28:235–242. https://doi.org/10.1093/nar/28.1.235
Article CAS PubMed PubMed Central Google Scholar
Berman H, Henrick K, Nakamura H (2003) Announcing the worldwide Protein Data Bank. Nat Struct Biol 10:980. https://doi.org/10.1038/nsb1203-980
Decherchi S, Rocchia W (2013) A general and robust ray-casting-based algorithm for triangulating surfaces at the nanoscale. PLoS ONE 8:e59744. https://doi.org/10.1371/journal.pone.0059744
Article CAS PubMed PubMed Central Google Scholar
Deng H, Birdal T, Ilic S (2018) PPFNet: global context aware local features for robust 3D point matching. In: 2018 IEEE/CVF Conf. Comp. Vision Pattern Recogn 195–205. https://doi.org/10.1109/CVPR.2018.00028
Dolinsky TJ, Nielsen JE, McCammon JA, Baker NA (2004) PDB-2PQR: an automated pipeline for the setup of Poisson-Boltzmann electrostatics calculations. Nucleic Acids Res 32:W665–W667. https://doi.org/10.1093/nar/gkh381
Article CAS PubMed PubMed Central Google Scholar
Draper DE (1999) Themes in RNA-protein recognition. J Mol Biol 293:255–270. https://doi.org/10.1006/jmbi.1999.2991
Article CAS PubMed Google Scholar
Edelsbrunner H (1999) Deformable smooth surface design. Discrete Comput Geom 21:87–115. https://doi.org/10.1007/PL00009412
Article Google Scholar
Fisher A, Rudin C, Dominici F (2019) All models are wrong, but many are useful: learning a variable’s importance by studying an entire class of prediction models simultaneously. J Mach Learn Res 20:1–81
CAS Google Scholar
Gainza P, Sverrisson F, Monti F, Rodolà E, Boscaini D, Bronstein MM, Correia BE (2020) Deciphering interaction fingerprints from protein molecular surfaces using geometric deep learning. Nat Methods 17:184–192. https://doi.org/10.1038/s41592-019-0666-6
Article CAS PubMed Google Scholar
Gao H, Ji S (2019) Graph u-nets. arXiv preprint. https://doi.org/10.48550/arXiv.1905.05178
Geng W, Krasny R (2013) A treecode-accelerated boundary integral Poisson-Boltzmann solver for electrostatics of solvated biomolecules. J Comput Phys 247:62–78. https://doi.org/10.1016/j.jcp.2013.03.056
Article CAS Google Scholar
Gligorijevic V, Renfrew PD, Kosciolek T, Leman JK, Berenberg D, Vatanen T, Chandler C, Taylor BC, Fisk IM, Vlamakis H, Xavier RJ, Knight R, Cho K, Bonneau R (2021) Structure-based protein function prediction using graph convolutional networks. Nat Commun 12. https://doi.org/10.1038/s41467-021-23303-9
Jones S, van Heyningen P, Berman HM, Thornton JM (1999) Protein-DNA interactions: a structural analysis. J Mol Biol 287:877–896. https://doi.org/10.1006/jmbi.1999.2659
Article CAS PubMed Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Zidek A, Potapenko A, Bridgland A, Meyer C, Kohl SAA, Ballard AJ, Cowie A, Romera-Paredes B, Nikolov S, Jain R, Adler J, Back T, Petersen S, Reiman D, Clancy E, Zielinski M, Steinegger M, Pacholska M, Berghammer T, Bodenstein S, Silver D, Vinyals O, Senior AW, Kavukcuoglu K, Kohli P, Hassabis D (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596:583–589. https://doi.org/10.1038/s41586-021-03819-2
Article CAS PubMed PubMed Central Google Scholar
Kingma DP, Jimmy BA (2014) Adam: a method for stochastic optimization. arXiv preprint. https://doi.org/10.48550/arXiv.1412.6980
Krapp LF, Abriata LA, Rodriguez FC, Dal Peraro M (2023) PeSTo: parameter-free geometric deep learning for accurate prediction of protein binding interfaces. Nat Commun 14:2175. https://doi.org/10.1038/s41467-023-37701-8
Article CAS PubMed PubMed Central Google Scholar
Lam JH, Li Yu, Zhu L, Umarov R, Jiang H, Amélie Héliou Fu, Sheong K, Liu T, Long Y, Li Y, Fang L, Altman RB, Chen W, Huang X, Gao X (2019) A deep learning framework to predict binding preference of RNA constituents on protein surface. Nat Commun 10:4941. https://doi.org/10.1038/s41467-019-12920-0
Article CAS PubMed PubMed Central Google Scholar
Li P, Liu Z-P (2023) GeoBind: segmentation of nucleic acid binding interface on protein surface with geometric deep learning. Nucleic Acids Res e60. https://doi.org/10.1093/nar/gkad288
Li P, Liu Z-P (2022) PST-PRNA: prediction of RNA-binding sites using protein surface topography and deep learning. Bioinformatics 38:2162–2168. https://doi.org/10.1093/bioinformatics/btac078
Article CAS PubMed Google Scholar
Mandel-Gutfreund Y, Schueler O, Margalit H (1995) Comprehensive analysis of hydrogen bonds in regulatory protein DNA-complexes. In Search of Common Principles. J Mol Biol 253:370–382. https://doi.org/10.1006/jmbi.1995.0559
Article CAS PubMed Google Scholar
Miao Z, Westhof E (2015) A large-scale assessment of nucleic acids binding site prediction programs. PLOS Comput Biol 11:e1004639. https://doi.org/10.1371/journal.pcbi.1004639
Article CAS PubMed PubMed Central Google Scholar
Nadassy K, Wodak SJ, Janin J (1999) Structural features of protein-nucleic acid recognition sites. Biochemistry 38:1999–2017. https://doi.org/10.1021/bi982362d
Article CAS PubMed Google Scholar
Qi CR, Yi L, SuH, Guibas LJ (2017) PointNet++: deep hierarchical feature learning on point sets in a metric space. arXiv preprint
Remmert M, Biegert A, Hauser A, Söding J (2012) HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods 9:173–175. https://doi.org/10.1038/nmeth.1818
Article CAS Google Scholar
Roche R, Bernard Moussad Md, Shuvo H, Tarafder S, Bhattacharya D (2024) EquiPNAS: improved protein–nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks. Nucleic Acids Res 52:e27. https://doi.org/10.1093/nar/gkae039
Article PubMed PubMed Central Google Scholar
Schmid EW, Walter JC (2024) Predictomes: a classifier-curated database of AlphaFold-modeled protein-protein interactions. bioRxiv preprint
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. Proc IEEE Internat Conf Comp Vision, pp 618–626
Shazman S, Elber G, Mandel-Gutfreund Y (2011) From face to interface recognition: a differential geometric approach to distinguish DNA from RNA binding surfaces. Nucleic Acids Res 39:7390–7399. https://doi.org/10.1093/nar/gkr395
Article CAS PubMed PubMed Central Google Scholar
Si J, Zhao R, Rongling Wu (2015) An overview of the prediction of protein DNA-binding sites. Int J Mol Sci 16:5194–5215. https://doi.org/10.3390/ijms16035194
Article CAS PubMed PubMed Central Google Scholar
Sverrisson F, Feydy J, Correia BE, Bronstein MM (2021) Fast end-to-end learning on protein surfaces. In 2021 IEEE/CVF Conf Comp Vision Pattern Recogn, pp 15267–15276. https://doi.org/10.1109/CVPR46437.2021.01502
Tubiana J, Schneidman-Duhovny D, Wolfson HJ (2022) ScanNet: an interpretable geometric deep learning model for structure-based protein binding site prediction. Nat Methods 19:730–739. https://doi.org/10.1038/s41592-022-01490-7
Article CAS PubMed Google Scholar
UniProt, Consortium (2023) UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res 51:D523–D531
Article Google Scholar
Varadi M, Anyango S, Deshpande M, Nair S, Natassia C, Yordanova G, Yuan D, Stroe O, Wood G, Laydon A, Žídek A, Green T, Tunyasuvunakool K, Petersen S, Jumper J, Clancy E, Green R, Vora A, Lutfi M, Figurnov M, Cowie A, Hobbs N, Kohli P, Kleywegt G, Birney E, Hassabis D, Velankar S (2021) AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models. Nucleic Acids Res 50:D439–D444. https://doi.org/10.1093/nar/gkab1061
Article CAS PubMed Central Google Scholar
Walia RR, Caragea C, Lewis BA, Towfic F, Terribilini M, El-Manzalawy Y, Dobbs D, Honavar V (2012) Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art. BMC Bioinformatics 13:89. https://doi.org/10.1186/1471-2105-13-89
Wang J, Cieplak P, Kollman PA (2000) How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J Comput Chem 21:1049–1074. https://doi.org/10.1002/1096-987X(200009)21:12<1049::AID-JCC3>3.0.CO;2-F
Article CAS Google Scholar
Wang K, Gang Hu, Zhonghua Wu, Hong Su, Yang J, Kurgan L (2020) Comprehensive survey and comparative assessment of RNA-binding residue predictions with analysis by RNA type. Int J Mol Sci 21:6879. https://doi.org/10.3390/ijms21186879
Article CAS PubMed PubMed Central Google Scholar
Wu Z, Pan S, Chen F, Long G, Zhang C, Yu PS (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 1–21. https://doi.org/10.1109/TNNLS.2020.2978386
Xia Y, Xia C-Q, Pan X, Shen H-B (2021) GraphBind: protein-structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues. Nucleic Acids Res 49:e51. https://doi.org/10.1093/nar/gkab044
Article CAS PubMed PubMed Central Google Scholar
Xiao X, Li S-X, Yang H, Chen XS (2016) Crystal structures of APOBEC3G N-domain alone and its complex with DNA. Nat Commun 7:12193. https://doi.org/10.1038/ncomms12193
Article CAS PubMed PubMed Central Google Scholar
Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys Rev Lett 120:145301. https://doi.org/10.1103/PhysRevLett.120.145301
Article CAS PubMed Google Scholar
Yan J, Friedrich S, Kurgan L (2015) A comprehensive comparative review of sequence-based predictors of DNA- and RNA-binding residues. Brief Bioinform 17:88–105. https://doi.org/10.1093/bib/bbv023
Article CAS PubMed Google Scholar
Yang H, Ito F, Wolfe AD, Li S, Mohammadzadeh N, Love RP, Yan M, Zirkle B, Gaba A, Chelico L, Chen XS (2020) Understanding the structural basis of HIV-1 restriction by the full length double-domain APOBEC3G. Nat Commun 11:632. https://doi.org/10.1038/s41467-020-14377-y
Article CAS PubMed PubMed Central Google Scholar
Yuan Q, Chen S, Rao J, Zheng S, Zhao H, Yang Y (2022) AlphaFold2-aware protein–DNA binding site prediction using graph transformer. Brief Bioinform 23:bbab564. https://doi.org/10.1093/bib/bbab564
Zhang J, Ma Z, Kurgan L (2019) Comprehensive review and empirical analysis of hallmarks of DNA-, RNA- and protein-binding residues in protein chains. Brief Bioinform 20:1250–1268. https://doi.org/10.1093/bib/bbx168
Article CAS PubMed Google Scholar
Zhou J, Cui G, Hu S, Zhang Z, Yang C, Liu Z, Wang L, Li C, Sun M (2018) Graph neural networks: a review of methods and applications. arXiv preprint. https://doi.org/10.48550/arXiv.1812.08434

Download references

Funding

Open access funding provided by SCELC, Statewide California Electronic Library Consortium. This work was supported by an Andrew J. Viterbi Fellowship in Computational Biology and Bioinformatics (to R.M.), the National Institutes of Health (grant R01AI150524 to X.S.C.; grant R35GM130376 to R.R.), and the Human Frontier Science Program (grant RGP0021/2018 to R.R.).

Author information

Jared M. Sagendorf
Present address: Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA

Authors and Affiliations

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, 90089, USA
Jared M. Sagendorf, Raktim Mitra, Jiawei Huang & Remo Rohs
Molecular and Computational Biology Section, Department of Biological Sciences, University of Southern California, Los Angeles, CA, 90089, USA
Xiaojiang S. Chen
Department of Chemistry, University of Southern California, Los Angeles, CA, 90089, USA
Xiaojiang S. Chen & Remo Rohs
Department of Physics and Astronomy, University of Southern California, Los Angeles, CA, 90089, USA
Remo Rohs
Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA, 90089, USA
Remo Rohs

Authors

Jared M. Sagendorf
View author publications
You can also search for this author in PubMed Google Scholar
Raktim Mitra
View author publications
You can also search for this author in PubMed Google Scholar
Jiawei Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xiaojiang S. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Remo Rohs
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.M.S. designed the PNAbind method conceptually; designed the GNN architecture and features; developed code for implementing, training, and evaluating models; and analyzed data. J.M.S. wrote the manuscript with help from R.R. and comments from all authors. R.M. developed code, assisted with figure preparation, implemented and tested model components, and processed data. J.H. assisted in data processing and hyperparameter tuning. X.S.C. provided data for experimental validation using the A3G protein. R.R. supervised the project.

Corresponding author

Correspondence to Remo Rohs.

Ethics declarations

Ethical approval

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 1.25 MB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sagendorf, J.M., Mitra, R., Huang, J. et al. Structure-based prediction of protein-nucleic acid binding using graph neural networks. Biophys Rev (2024). https://doi.org/10.1007/s12551-024-01201-w

Download citation

Received: 19 April 2024
Accepted: 28 May 2024
Published: 26 June 2024
DOI: https://doi.org/10.1007/s12551-024-01201-w

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Structure-based prediction of protein-nucleic acid binding using graph neural networks

Abstract

Similar content being viewed by others

Physicochemical graph neural network for learning protein–ligand interaction fingerprints from sequence data

Decoding the protein–ligand interactions using parallel graph neural networks

Graph Neural Networks in Predicting Protein Function and Interactions

Introduction

Methods

Mesh generation

Vertex features

Edge features

Probabilistic binding model

Segmentation

Graph classification

Probability thresholds

GNN model architecture

Segmentation

Classification

Datasets

Classification

Segmentation

Model training procedure

Results

PNAbind overview

Surface mesh features

Prediction of DNA and RNA binding function

Mechanistic interpretation of DNA versus RNA binding predictions

Feature attribution

Spatial attribution

Binding site prediction

Comparison with recent methods

Predictions made using protein assemblies are more accurate

Binding site predictions on AlphaFold predicted protein structures

Negative control experiment

Feature ablation study

Case study—APOBEC3G dimerization and RNA binding

Discussion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethical approval

Competing interests

Additional information

Publisher's Note

Supplementary information

Supplementary file1 (PDF 1.25 MB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation