Introduction

Nuclear receptors (NRs) are a family of transcription factors that play a crucial role in regulating various biological processes including cell growth, development, and metabolism [1, 2]. Their biochemical significance has promoted to a great deal of research in the fields of toxicology and medicinal chemistry, with many drug discovery projects using machine learning (ML) to select compounds for the development of NR-based drugs. Nonetheless, discovering novel NR-modulators with high affinity and specificity is difficult due to structural similarities and shared domains among multiple NRs [3,4,5].

Ongoing research in computational toxicology is focused on developing in silico methods to modulate the activity for a group of NRs or the selectivity among specific NRs. Several ML models of chemical activity against multiple NRs have begun to emerge to predict various NR-modulators with the potential to target various diseases [6]. However, there is limited research on the impact of using distinct types of NRs to design quantitative structure-activity relationship (QSAR) models for a given target receptor. In addition, current computational approaches are centered around a single receptor, and there hasn’t been an attempt to transfer the learned knowledge across multiple NRs [7].

One potential approach for NR-modulation with QSAR is multi-task learning. In the case of NRs, multi-task learning can be applied to train a model across multiple related tasks and evaluate the NR-activity with different ligands, or to infer the effects of NR-ligands in different tissues [8, 9]. In drug discovery, where high-quality labeled information is limited, meta-learning is particularly useful as it allows to learn across few-shot tasks for different molecular properties and improve generalization with few labeled compounds [10,11,12,13,14]. Hence, multi-task meta-learning can improve the accuracy of QSAR models to predict the activity of compounds on specific biological targets with limited data [15,16,17,18,19].

Compounds can be represented using molecule graphs, with nodes representing the atoms, and edges describing the chemical bonds [20]. Graph neural networks (GNNs) update node-edge embeddings in graph-structured data with neighborhood aggregation to output a graph-level embedding useful for molecular property discovery [21, 22]. However, standard GNNs only aggregate local dependencies and are incapable of capturing the broader aspects of node-edge connections significant for compound classification. Alternatively, Transformers have been developed to tackle this issue by learning long-range dependencies while maintaining the global structure of molecular embeddings. These models attend to multiple positions to preserve global-semantic information in molecule embeddings and generalize for different molecular properties [23, 24]. Vision Transformers (ViT) extend standard Transformer attention to propagate sequences of visual tokens and obtain improved performances on image classification tasks [25]. Recent advancements in ViT approaches have derived multiple hybrid architectures that combine them with different neural network models [26]. Nevertheless, the potential of ViT networks is still to be revealed in molecule representation learning when inferring the NR-binding activity of chemical compounds for NR-based drug discovery.

To address this challenge, a novel few-shot GNN-Transformer, Meta-GTNRP is introduced for NR-binding activity prediction using limited labeled compounds. The proposed approach considers compounds as graph-structured data encoding the local-to-global context of molecule structures for NR-binding activity prediction. In addition, a meta-learning approach is proposed to optimize model parameters in multiple few-shot tasks and predict their specific NR-binding properties with limited data. In this research, we make use of a Nuclear Receptor Activity (NURA) database [27] to describe the experimentally-derived binding, agonist and antagonist activities against various human NRs. Multiple experiments with NR-activity data demonstrate that Meta-GTNRP achieves an improved performance on NR-activity tasks over the conventional graph-based approaches.

Related work

Few-shot learning for NR-binding activity prediction

Few-shot learning (FSL) is a meta-learning approach that focuses on generalizing with reduced amounts of supervised information. FSL has found recent applications in compound discovery by predicting clinically-relevant properties using limited high quality data. Here, the goal of FSL is to adapt model parameters for different molecular tasks (meta-training) and use them to predict important molecular property tasks using small sets of labeled compounds (meta-testing) [28, 29]. FSL methods can be classified in two categories: metric-based models [30] learn a distance metric that captures the relationship between task-specific support sets and separate query sets, enabling effective transfer of knowledge for different few-shot tasks. On the other hand, optimization-based methods [31] adapt model parameters within different tasks represented by task-specific support sets and generalize for novel representations in separate query sets using few gradient steps. In this research, we introduce an optimization-based meta-learning approach to learn across different NR-tasks and generalize to new NR-binding meta-testing tasks. In meta-training, few-shot models are trained using NR-specific support sets to adapt model parameters for different NR-specific tasks by computing gradients and losses in disjoint query sets of molecules. In meta-testing, these parameters are used to infer the NR-binding properties of compounds for new NR tasks using limited data.

Graph representation learning

Representing molecules using graph-structured data can more accurately depict the relationships among atoms important to predict NR-binding properties [32]. GNNs work with molecule graphs to encode molecules and use a set of nodes to represent the atoms and a set of edges to describe chemical bonds between atoms. Through a message-passing approach, GNNs aggregate node-edge information to compute molecule graph embeddings, capturing the molecule’s overall structure in a multi-dimensional graph embedding space. More specifically, graph convolutional networks (GCNs) [33] incorporate a convolutional operation which aggregates local information and updates node-edge embeddings, similar to convolutional filters used in convolutional layers. An alternative technique, called GraphSAGE [34], utilizes a node-centric inductive training approach to learn the node embeddings in large molecule graph structures for unseen graph features. In addition, graph isomorphism networks (GIN) [35] are powerful GNN frameworks that extend the Weisfeiler-Lehman (WL) isomorphism test, demonstrating impressive results on different downstream applications. In a pioneering study, Hu et al. [36] have pre-trained GNNs to learn local information and obtain improved performances across various chemical property tasks. Based on this method, Guo et al. [37] proposed a novel meta-learning approach that allows GNNs to fast adapt across tasks using task-specific weights to meet self-supervised objectives in molecular property discovery.

Transformer networks

Transformer networks, introduced by Vaswani et al. [38], are natural language processing (NLP) models that leverage self-attention to learn from sequential data and retain its global structure. The attention mechanism can be complemented using feed-forward (FF) layers, making it a commonly used method for various NLP tasks. Vision Transformers (ViT) introduce a new application of Transformers to generalize in image classification tasks. Dosovitskiy et al. [39] develop this novel approach, to outperform the conventional convolutional networks by treating inputs as sequences of non-overlapping image tokens known as patches. ViT blocks include multi-head self-attention layers and FF networks which model long-range dependencies among patches for computer vision tasks [40, 41]. In molecule discovery, the application ViT networks has not been extensively studied. In this research, we develop a few-shot graph-based ViT architecture which combines the local context of molecule graphs with global-semantic information captured by attention operations to effectively predict the NR-binding properties using reduced amounts of labeled compounds.

Nuclear receptor data

In this work, data is collected using a public compound repository known as NURA (NUclear Receptor Activity) database [27], which includes public information on the activity of 15,247 compounds on 11 human NRs. The database contains information of compounds collected from sources such as ChEMBL25 [42], BindingDB [43], NR-DBIND [44] and Tox21 [45] to express the chemical structure using SMILES strings (Simplified Molecular Input Line Entry System) [46]. In this study, molecules are represented using molecular graphs obtained from SMILES using the RDKit.Chem library [47] which are pre-processed, so that SMILES are canonicalized and duplicates are removed. This curated dataset refers to the in vitro bioactivity data of compounds on 11 nuclear receptors (NRs), selected based on their biological significance and availability on public databases including the following: androgen receptor (AR), estrogen receptor \(\alpha\) (ERA), estrogen receptor \(\beta\) (ERB), progesterone receptor (PR), glucocorticoid receptor (GR), peroxisome proliferator-activated receptor \(\alpha\) (PPARA), peroxisome proliferator-activated receptor \(\gamma\) (PPARG), peroxisome proliferator-activated receptor \(\delta\) (PPARD), pregnane X receptor (PXR), retinoid X receptor (RXR) and farnesoid X receptor (FXR). Thus, the dataset comprises measurements of bioactivity against 11 NRs such as binding activity, agonist activity and antagonist activity. In each case, compounds are assigned an activity label, given their experimental bioactivities against specific NRs: (1) “active”, if bioactivity is equal to or lower than 10,000 nM (positive); (2) “weakly active”, for bioactivity between 10,000 and 100,000 nM (positive); (3) “inactive”, for bioactivity values greater than 100,000 nM (negative); (4) “inconclusive”, for compounds having conflicting labels for all 3 cases; (5) “missing”, for compounds having missing information for at least one case. In our experiments, we merge both “active” and “weakly active” into a positive label and “inactive” into a negative label for binding (BIN), agonist (AGO) and antagonist (ANT) activity classification tasks and compounds with “inconclusive” and “missing” labels are excluded. Table 1 below reports the distribution of compounds for all 11 NR activity labels and for all NR binding (BIN), agonist (AGO) and antagonist (ANT) activity classification tasks.

Table 1 Distribution of positive and negative samples for binding, agonist and antagonist activity labels for all 11 nuclear receptors

Methods

Graph neural network module (GNN)

Molecular graphs are graph-structured representations of atoms and their connections via chemical bonds within a molecule. Molecular graphs are denoted by \(G = (V,E)\), with \(V\) the set of nodes \(v\) (atoms) and \(E\) the set of edges \(e\) (chemical bonds). Edges are defined by \(e = (v,u)\), where \(v\) and \(u\) are nodes interconnected in a neighborhood \(N(v)\). Graph neural networks (GNNs) use a neighborhood aggregation function to update node embeddings \(h_v\) and build graph embedding representations \(h_G\) used in molecule classification. In this research, a GIN with \(L_{GIN} = 5\) layers is proposed as an embedding network to detect the local dependencies in molecular graphs \(G\) and compute graph embeddings \(h_G\). The GIN performs AGGREGATE and COMBINE steps as a sum of node and edge features. Node embeddings \(h_v\) are updated for each message-passing iteration \(l\) by

$$\begin{aligned} & m_{N(v)}^{l} = AGGREGATE^{l}( \{h_u^{l-1},\forall u \in N(v)\}, \{ h_e^{l-1}: e = (v,u) \}) \end{aligned}$$
(1)
$$\begin{aligned} & h_{v}^{l} = \sigma (MLP^{l} (COMBINE^{l} (h_{v}^{l-1}, m_{N(v)}^{l})))\end{aligned}$$
(2)

with \(m_{N(v)}\) the “message” propagated throughout GNN layers, \(h_u^l\) the embeddings for neighboring nodes, and \(h_e^l\) the embedding for the edge between nodes \(u\) and \(v\). After node-edge aggregation, multiple message-passing iterations \(l\) update node embeddings \(h_v^l\) using prior representations of that node \(h_v^{l-1}\) and embeddings of its neighboring nodes \(h_u^{l-1}\) with \(u \in N(v)\). The UPDATE step applies multi-layer perceptrons \(MLP\) to introduce non-linearity followed by non-linear activation \(\sigma = ReLU\)

$$\begin{aligned} h_{v}^{l} = ReLU(MLP^{l}(\sum _{u\in N(v) \cup {v}} h_{u}^{l-1} + \sum _{e = (v,u): u\in N(v) \cup {v}} h_{e}^{l-1})). \end{aligned}$$
(3)

At the final layer \(L_{GIN} = 5\), a READOUT step pools node embeddings to produce a graph-level embedding \(h_{G}\). This graph embedding is obtained by averaging node representations \(h_v\) using a mean-pooling operation, \(h_{G} = mean(\{h_{v}^{L_{GIN}}: v \in V\})\). Input node-edge embedding features \((h_v^0, h_e^0)\) are described by multiple atom-bond attributes including atom type (AT), atom chirality (AC) with \(h_v^0 = \{v_{AT}, v_{AC}\}\), and bond type (BT), bond direction (BD) with \(h_e^0 = \{e_{BT}, e_{BD}\}\). Pre-trained GNNs of Hu et al. (2020) [36] are leveraged to pre-train the GIN model for better initialization. In this setting, we consider 5 GIN layers and an embedding size of 300. A schematic of the GNN-Transformer architecture is presented in Fig. 1.Footnote 1

Fig. 1
figure 1

Graphical depiction of the two-module GNN-Transformer architecture, Meta-GTNRP

Transformer prediction module (TR)

In our research, we investigate how to combine Transformers and GNNs to better discriminate the global-semantic context and long-range dependencies within molecule graph embeddings \(h_G\) for NR-binding activity prediction. A Transformer network with \(L_T = 5\) blocks is introduced to convert graph embeddings \(h_G\) into token embeddings \(h_T\). This prediction module operates as a vision Transformer (ViT) [35, 48] that accepts graph embedding \(h_G\) transformed into sequences of patches considering a space of dimension \(D = N \times P^2\) where \(N\) is the number of patch tokens and \(P\) the size of individual patch tokens. The Transformer accepts embeddings \(x\) converted into sequences of patches \(x_p\)

$$\begin{aligned} T(x) = [x_p^1, x_p^2,\ldots ,x_p^N]\end{aligned}$$
(4)

where \(x_p^i\) denotes the \(i\)-th patch vector. Specifically, the Transformer converts graph embeddings \(h_G\) into \(N = (\lfloor \frac{300}{P} \rfloor )^2\) patch tokens of size \(P\). Token embeddings \(h_T(x) = T(x). K\) are produced by a linear projection \(K \in \mathbb {R}^{P^2 \times D}\) of patch vectors \(x_p^i\) in a Transformer dimension \(D\). The Transformer propagates token embeddings \(h_T\) in MSA layers. MSA takes three inputs: queries \(q\), keys \(k\) and values \(v\) stacked in matrices \((Q, K, V)\) to calculate a dot-product attention between queries \(q\) in \(Q\) and keys \(k\) in \(K\). MSA considers a softmax operation to obtain the attention weights for values \(v\) in \(V\). In addition, MSA layers include \(H\) projection heads and attention values are calculated by

$$\begin{aligned} & MSA (Q,K,V) = CONCAT(head_1,\ldots , head_H) W \end{aligned}$$
(5)
$$\begin{aligned} & head_j = Attention(QW^Q_j,KW^K_j,VW^V_j) = softmax(\frac{QW^Q_j (KW^K_j)^T}{\sqrt{D}})VW^V_j \end{aligned}$$
(6)

where \((W^Q_j, W^K_j, W^V_j)\) are the projection matrices of \((Q,K,V)\) for each attention head \(j\). Transformer blocks use \(MSA\) layers followed by feed-forward networks (\(FFN\)). \(FFN\) include a point-wise (\(PW\)) convolutional operation to undersample \(Q\) and \(K\) and model the local context of token embeddings more efficiently. Then, a convolution operation is applied to the \(Q\), \(K\) and \(V\) matrices using a depth-wise (\(DW\)) separable convolution with kernel size \(s = 3\) followed by batch normalization and a \(PW\) convolution operation. The \(MSA\) and \(FFN\) layers are preceded by a layer normalization \((LN)\) followed by residual connections. The individual patch tokens \(x_p\) are propagated across multiple Transformer blocks \(l\) to obtain the token embedding representations \(h_T\) given by

$$\begin{aligned} & h_{T}^0 = [x_p^1 K, x_p^2 K, x_p^3 K,\ldots , x_p^N K] + K_{pos} \end{aligned}$$
(7)
$$\begin{aligned} & h_T^{l^*} = MSA(LN(h_T^{l-1})) + h_T^{l-1} \end{aligned}$$
(8)
$$\begin{aligned} & h_T^l = FFN(LN(h_T^{l^*})) + h_T^{l^*} \end{aligned}$$
(9)
$$\begin{aligned} & y = LN(h_{T}^{L_T})\end{aligned}$$
(10)

where \(l = \{1,\ldots , L_T\}\), \(h_{T}^{l}\) are the deep token embedding representations, \(K\) are the linear projections of individual patch tokens and positional embeddings with \(K \in \mathbb {R}^{P^2 \times D}\) and \(K_{pos} \in \mathbb {R}^{(N+1) \times D}\) and where \(y\) is an output vector. Then, a \(MLP\) followed by a sigmoid activation function is applied to the the output \(cls\) token to predict a molecular label for different NR-binding activity prediction tasks (with an output value \(\in \{0,1\}\)). The Transformer prediction module of Meta-GTNRP has the main hyper-parameters displayed in Table 2.

Table 2 Main hyper-parameters of the Transformer module

Few-shot meta-learning framework for NR-binding activity prediction

In this research, a few-shot meta-learning framework built upon two distinct neural network modules is introduced to learn complementary information across few-shot tasks for NR-binding activity prediction. This strategy leverages the relationship among different NRs by the means of integrating information of NR-specific predictive tasks with a joint learning procedure. This framework is composed of two main components: a GNN and a Transformer (TR) modules. Both of these meta-models update model parameters for different few-shot tasks (meta-training) for 10 different NRs using random support sets for training and query sets with remaining samples for evaluation. Then, these parameters are leveraged to infer the binding activity of compounds against a 1 new specific NR (meta-testing) [30, 31]. In this framework, molecules are organized across meta-training tasks to optimize the model parameters by evaluating the binding activity of compounds for 10 different NRs. Then, the parameters obtained in meta-training are leveraged to infer the binding activity of compounds for 1 new NR, in a new meta-testing task. The main objective is to predict the binding activity of compounds for 1 specific NR using the NR-binding information of other 10 NR-binding tasks, so that \(\{f_{\theta }(G), g_{\theta ^*}(h_G)\}:S\Rightarrow \{0,1\}\in Y\), where \(S\) is the chemical space of molecule graphs \(G\), \(h_G\) is the output graph embedding space of a GNN \(f_{\theta }\), \(g_{\theta ^*}\) is a Transformer (TR), and \(Y\) the labels for each individual NR. A GIN model\(f_{\theta }\) with parameters \(\theta\) and a TR model \(g_{\theta ^*}\) with parameters \(\theta ^*\) learn across different few-shot tasks \(t \in \{1\ldots ,N_{NR-train}\}\) for each individual NR. For each meta-task, meta-models \(f_{\theta }\) and \(g_{\theta ^*}\) are trained using random support sets \(S_{t}\) of molecule graphs \(G_{S_{t_i}}\) and evaluated using query sets \(Q_{t}\) of graphs \(G_{Q_{t_i}}\).

In meta-training, support sets with size \(k\) are randomly sampled as the input for a GNN \(f(\theta )\) and TR \(g(\theta ^*)\) models to obtain the support losses \(\mathcal {L}^{GNN}_{t}\), \(\mathcal {L}^{TR}_{t}\) for each individual NR across meta-training tasks \(t \in \{1\ldots ,N_{NR-train}\}\) with \(N_{NR-train} = 10\). Support losses are used to iteratively update parameters \(\theta \rightarrow \theta '\), \(\theta ^{*} \rightarrow \theta ^{*'}\). Then, the GNN and TR models compute query losses \(\mathcal {L}^{GNN '}_{t}\),\(\mathcal {L}^{TR '}_{t}\) with the remaining \(n\) samples for a specific task. At this stage, to update parameters \(\theta\), \(\theta ^{*}\) we consider few gradient steps

$$\begin{aligned} & \theta _{t} = \theta - \alpha \triangledown _\theta \mathcal {L}^{GNN}_{t}(\theta ) \end{aligned}$$
(11)
$$\begin{aligned} & \theta ^*_{t} = \theta ^* - \alpha ^* \triangledown _{\theta ^*} \mathcal {L}^{TR}_{t}(\theta ^*) \end{aligned}$$
(12)

where \(\alpha\) and \(\alpha ^*\) are the step sizes for gradient descent updates. Then, in meta-testing, support sets with \(k\) random samples is obtained for 1 new NR-specific test task \(t = N_{NR-train}+N_{NR-test}\) and parameters \(\theta\), \(\theta ^{*}\) are initialized using model parameters obtained in meta-training, \(\theta \rightarrow \theta '\), \(\theta ^* \rightarrow \theta ^{*'}\) with \(N_{NR-test} = 1\). Then, GNN and TR models are evaluated using query sets of new molecules with the remainder of the samples \(n\) for this test task, for predicting the binding activity of compounds for 1 specific NR with just a few labeled compounds. In this meta-learning framework, the NR data is divided into a set of meta-training and meta-testing tasks for different NRs. During the meta-training phase, a set of meta-training tasks for 10 different NRs are performed. For each task, a random support set of size \((k+, k-)\) (\(k+\) positive and \(k-\) negative samples) is sampled for training, and a disjoint query set is sampled for evaluation. More specifically, we compute the gradient of the loss with respect to Meta-GTNRP parameters using just a few examples from that task and update model parameters such that it performs well on the query set of this task. The updated parameters obtained in meta-training are then used to initialize Meta-GTNRP to predict a query set of new compounds of a new NR meta-testing task with limited data. This meta-learning framework is illustrated in Fig. 2 with the algorithm shown below.

Fig. 2
figure 2

Schematic of the GNN-Transformer meta-learning framework for NR-binding activity prediction. This framework is composed by two neural networks a GNN \(f\) and a Transformer (TR) \(g\) with parameters \(\theta\) and \(\theta '\), respectively. In meta-training, few-shot tasks \(t\) include random support sets \(S_t\) with positive samples \(k_{+}\), negative samples \(k_{-}\) provided for training. Then, the remaining \(n\) data points are used as query sets \(Q_t\) for evaluation. The GNN and TR models, \(f\) and \(g\) consider 10 few-shot tasks for 10 different NRs in meta-training. Then, in meta-testing, the updated parameters obtained are used to initialize the Meta-GTNRP model to predict the NR-binding activity of new compounds in query sets of 1 new NR task using random support sets of size \((k_{+}, k_{-})\) for \(k\)-shot experiments

Algorithm 1
figure a

Meta-GTNRP: Few-shot GNN-Transformer Meta-Learning Framework

Loss function for NR-binding activity prediction

The loss function for the GNN and Transformer models, \(\mathcal {L}^{GNN}\) and \(\mathcal {L}^{TR}\) is a binary cross-entropy loss. However, to answer the issue of class imbalance in NR-binding activity prediction, a weighted loss significantly penalizes the misclassifications with rare-class instances. Hence, the cross-entropy loss includes a weight \(w\) for a minority class and is formalized by

$$\begin{aligned} \mathcal {L} = - \frac{1}{n} \sum _{i=1}^{n} w \ y_i \ log(y_i') + (1-y_i) \ log(1-y_i') \end{aligned}$$
(13)

where \(y'\) are the predictions and \(y\) the binding activity labels with \(n\) the number of datapoints. Since we observe different positive–negative ratios for individual NR tasks, the value \(w\) is obtained by exploring different values within an appropriate range. Thus, we establish a value of \(w = 5\) due to a task-specific variability among NR-specific predictive tasks.

Baseline methods

The proposed few-shot GNN-Transformer model, Meta-GTNRP is compared with other 3 types of GNNs:

  1. 1.

    GIN - top-performing GNN method that applies the Weisfeiler-Lehman (WL) isomorphism test to aggregate important parts of the node’s neighborhood [35];

  2. 2.

    GCN - updates node embeddings using convolution for neighborhood aggregation similar to convolutional filters found in convolutional networks [33];

  3. 3.

    GraphSAGE - uses an inductive framework based on a node-centric training method to update node features within unseen graph representations [34].

These GNN baselines are obtained by the removal of the Transformer component of Meta-GTNRP and are also trained and evaluated using a few-shot meta-learning approach to ensure a comparable performance across few-shot tasks in the 5-shot and 10-shot settings. These GNN models also use pre-trained models of Hu et al. [36] for improved initialization and are optimized using a standard binary cross-entropy loss function.

The graph-based baselines and Meta-GTNRP are implemented in Python 3.9.16 and PyTorch 1.13.0 with CUDA 11.6, along with functions in Scikit-learn 1.2.2, Numpy 1.22.3, Pandas 1.5.3 and RDKit 2022.03.5. The best GNN and Meta-GTNRP models are selected at the epoch giving the best ROC-AUC score in the query set of the final meta-testing task and we allow it to run for at most \(1000\) epochs. Additionally, we consider update steps of 5 for meta-training and 10 for meta-testing. In addition, we consider GNN models with a total of 5 message-passing layers and a graph embedding dimension of 300.

The GNN baseline models are trained and evaluated using a few-shot meta-learning approach [31] in the 5-shot and 10-shot settings. Similarly to the Meta-GTNRP model, for NR-binding activity prediction tasks, we consider 10 meta-training tasks for 10 different NRs and 1 final meta-testing task for the remaining NR. For NR-agonist and NR-antagonist activity prediction tasks, we also consider the single-task (ST) models considering 1 meta-training task and 1 final meta-testing task. This experimental setup ensures a comparable performance between Meta-GTNRP and the GNN baselines.

Nuclear receptor binding activity experiments

In this study, we evaluate the binary classification of compounds across few-shot tasks for NR-binding activity prediction. For a total of 11 NRs, the proposed Meta-GTNRP model considers 10 meta-training tasks for 10 different NRs and is evaluated on 1 final meta-testing task for a specific NR with limited available data (see Fig. 2). Specifically, we calculate seven performance metrics on the query set of a test task for each of 11 different NRs: Sensitivity (Sn), Specificity (Sp), Precision (Pr), Accuracy (Acc), ROC-AUC score, and F1 score (F1s). ROC-AUC is the main performance metric and computes the area under the receiver operating characteristic curve to evaluate the performance in imbalanced scenarios for NR-binding activity prediction. Here, we conduct 5-shot and 10-shot experiments with random support sets of size \((5+,5-)\) and \((10+,10-)\) for each individual NR, respectively. Experiments are repeated 30 times, using random support sets each time, to obtain a robust estimate of performance for each metric. In Table 3, we report the mean and standard deviations of ROC-AUC results obtained by the Meta-GTNRP model considering 10 tasks in meta-training for 10 different NRs across 30 experiments with \((5+,5-)\) (5-shot) and \((10+,10-)\) (10-shot) random support sets and evaluated on 1 meta-testing task for 1 specific NR. In bold, we show the best ROC-AUC results for each NR-specific test task. The \({ \triangle }\) column shows the difference in performance results of the proposed model and graph-based baselines. In the Supplementary Material, figures show scatter plots overlaid by boxplots show the ROC-AUC scores and standard deviations obtained across 30 experiments with 5-shot and 10-shot random supports sets for NR-binding activity prediction tasks on 11 different NRs. The full performance results for each metric are shown in Tables also displayed in the Supplementary Material section.

Table 3 Average ROC-AUC scores obtained across 30 experiments with random support sets of size \((5+, 5-)\) (5-shot) and \((10+, 10-)\) (10-shot) by Meta-GTNRP and few-shot GNN baselines in NR-binding activity prediction for 11 different NRs

Transfer-learning agonist and antagonist activity experiments

In our experiments, we also test the ability of the Meta-GTNRP model to transfer the knowledge of a single NR-binding task for one individual NR to predict the agonist or antagonist activity of compounds on that specific NR (see Fig. 3). The main goal is to evaluate the generalization power of the Meta-GTRNP model by the means of single-task (ST) models obtained using the NR-binding information to determine which compounds have agonist and antagonist activity for individual NRs. Here, we conduct few-shot experiments to transfer the learning of ST models trained on NR-binding activity information to predict the NR-specific agonist and antagonist activity with limited data.

Fig. 3
figure 3

Schematic of the meta-learning framework for NR-agonist and NR-antagonist activity prediction. In this experiment, we consider 2 few-shot tasks, 1 meta-training task with NR-binding information and 1 meta-testing task for NR-agonist or NR-antagonist activity prediction. In our proposed framework, both GNN and Transformer models, \(f\) and \(g\) consider a NR-binding task in meta-training to build single-task (ST) models evaluated on the corresponding NR-specific agonist or antagonist test task. The model performance is assessed on the query set of a new NR-agonist or NR-antagonist meta-testing task using a random support set of size \((k_{+}, k_{-})\) for \(k\)-shot experiments

In this case, we conduct 5-shot and 10-shot experiments for ST models trained on a single NR-binding activity task to predict the agonist or antagonist activity for that specific NR on a single NR agonist or antagonist activity test task. These experiments are repeated 30 times, using random support sets each time. In Tables 4 and 5, we report the mean and standard deviations of ROC-AUC results for the agonist and antagonist activity predictions made by ST models across 30 experiments with \((5+,5-)\) (5-shot) and \((10+,10-)\) (10-shot) random support sets, considering 1 NR-specific binding task in meta-training and evaluated on 1 agonist or antagonist meta-testing task for that specific NR, respectively. In bold, we present the best ROC-AUC results for each individual NR-specific test task. The \({ \triangle }\) column shows the difference in performance results of the proposed model and graph-based baselines. In the Supplementary Material, figures show scatter plots overlaid by boxplots show ROC-AUC scores and standard deviations obtained across 30 experiments with 5-shot and and 10-shot random supports sets in agonist and antagonist activity prediction on 11 different NRs. The full performance results using different metrics are shown in Tables also displayed in the Supplementary Material.

Table 4 Average ROC-AUC scores obtained across 30 experiments with random support sets of size \((5+, 5-)\) (5-shot) and \((10+, 10-)\) (10-shot) by the single-task (ST) Meta-GTNRP models and single-task (ST) GNN baselines considering 1 NR binding task in meta-training and 1 NR agonist task in meta-testing
Table 5 Average ROC-AUC scores obtained across 30 experiments with random support sets of size \((5+, 5-)\) (5-shot) and \((10+, 10-)\) (10-shot) by the single-task (ST) Meta-GTNRP models and single-task (ST) GNN baselines considering a 1 NR binding task in meta-training and 1 NR antagonist task in meta-testing

Statistical significance analysis of performance results

In this work, ROC-AUC (Receiver Operating Characteristic-Area Under the Curve) scores evaluate the performance of the Meta-GTNRP model and graph-based baselines. The ROC-AUC score is a widely used metric to measure the ability to distinguish between positive and negative samples and, particularly useful to measure performance in limited imbalanced data. However, simply reporting the ROC-AUC scores alone may not be sufficient to draw conclusions about the significance of the performance difference between different models.

To address this issue, we performed a statistical significance analysis to determine whether the differences in the ROC-AUC scores between the Meta-GTNRP model and GNN baselines are statistically significant or merely due to chance. This analysis compared the ROC-AUC results obtained using a statistical significance test. The \(p\)-values calculated indicate the probability of observing such a difference by chance alone, and a threshold level of significance (\(\alpha =0.05\)) is used to determine whether the difference is statistically significant. The first step was to compute a normality test to determine if ROC-AUC scores for each pair of performance results are normally distributed. We used a normality test provided by the SciPy library [49] based on the D’Agostino-Pearson omnibus test [50, 51], which combines skewness and kurtosis measurements to provide a \(p\)-value that indicates the likelihood that the data is normally distributed. The null hypothesis for this test is that the data is normally distributed. Hence, if the \(p\)-value is less than the significance level (\(p < 0.05\)), we can reject the null hypothesis and conclude that the data is not normally distributed. Otherwise, we fail to reject the null hypothesis, indicating that ROC-AUC results are more likely to follow a normal distribution.

Next, we assessed the descriptive statistic of the normality test to evaluate whether the variance among both distributions was the same. We found a difference in variance between both distributions for all model results. To test the statistical significance between each pair of results, we used a modified version of the Student t-test if both distributions are more likely to be normal. This version of the t-test, known as Welch’s t-test, is used when there is unequal variance between the two distributions being compared. If the distributions are unlikely to follow a normal distribution, we apply the Mann–Whitney U non-parametric test. In the statistical significance test, \(p\)-values are calculated considering the hypothesis:

  • H0: Performance results are likely drawn from the same distribution;

  • H1: Performance results are likely drawn from different distributions (reject H0).

The calculated \(p\)-values are used to assess the statistical significance of the mean differences between the two distributions, considering a significance level of \(\alpha =0.05\). If the \(p < 0.05\), we reject the null hypothesis (H0) and conclude that there is evidence to support the alternative hypothesis (H1), indicating that the observed result is statistically significant. In Table 6, we report the significance analysis of ROC-AUC scores obtained across 30 experiments by Meta-GTNRP with respect to GNN baselines in the 5-shot and 10-shot settings for the NR-binding, NR-agonist and NR-antagonist activity prediction tasks. The results show that all the \(p\)-values are lower than the significance level, leading us to reject the null hypothesis and conclude that the ROC-AUC results of Meta-GTNRP are statistically significant when compared with the GNN baselines.

This analysis enabled us to determine the significance of the performance differences between Meta-GTNRP and GNN baselines, allowing us to draw important conclusions about the effectiveness of the Meta-GTNRP model in NR-activity prediction tasks. In addition, the statistical significance analysis provided valuable insights for comparing the performance of few-shot models, offering robust evidence to support the validity of our findings.

Table 6 \(p\)-value results of the statistical significance test for 5-shot and 10-shot experiments in NR-binding (BIN), NR-agonist (AGO) and NR-antagonist (ANT) activity prediction tasks

Discussion

In this work, we introduce a few-shot GNN-Transformer, Meta-GTNRP to predict the NR-binding properties across 11 different NR tasks using limited available data. The proposed few-shot two-module meta-learning framework combines information of 10 NR-specific meta-training tasks to predict the NR-binding activity of compounds for 1 new NR in a final meta-testing task. It is demonstrated that Meta-GTNRP achieves a superior performance in NR-binding activity prediction over the standard GNN methods.

In Table 3, we report the average ROC-AUC scores obtained across 30 experiments with \((5+,5-)\) (5-shot) and \((10+,10-)\) (10-shot) random support sets for models considering 10 meta-training NR tasks in meta to predict the binding activity of compounds for 1 remaining NR meta-testing task. The results show that the proposed Meta-GTNRP model outperforms the GNN baseline methods (GIN, GCN, GraphSAGE) for the majority of NR test tasks in the 5-shot and 10-shot settings. The proposed approach achieves a superior performance taking into account the class imbalance scenarios across tasks and the lack of labeled information for each individual NR. In general, the Meta-GTNRP model shows the best ROC-AUC results and high Sn and Sp values, thus correctly predicting a high proportion of both active and non-active binders. The standard deviations reported also indicate a smaller variance, which ensures a stable performance and more robust results across 5-shot and 10-shot experiments (see Figures in the Supplementary Material). Inversely, GIN, GCN and GraphSAGE produce unstable performances and high-variance predictions across NR test tasks, which means that they may generalize well in some cases, but collapse for the majority of NR-binding experiments.

In our experiments, we also test the ability of the Meta-GTNRP model considering a single NR-binding task in meta-training to predict the NR agonist and antagonist activity results for a specific NR. The goal is to evaluate the performance of single-task (ST) models for each individual NR and compare the performance in NR agonist and antagonist activity prediction with the standard GNN baselines. There are relevant differences in performance yielded by ST-models for each NR in NR-agonist and NR-antagonist predictive tasks. Nonetheless, in 5-shot and 10-shot experiments, the ROC-AUC scores in Table 4 show that the Meta-GTNRP model exhibits higher and more robust results in NR-agonist activity prediction for most NR test tasks when compared with the graph-based baselines. Similarly, Meta-GTNRP also performs better in NR-antagonist activity prediction for most NR tasks across 5-shot and 10-shot experiments as shown in Table 5. It is important to note that individual NRs that include higher class imbalance or limited data are more likely expected to negatively affect predictive performance.

This study addresses key challenges related to class imbalance and limited data when predicting the activity of compounds for specific NRs. Class imbalance, where certain classes are underrepresented, biases predictive models toward the majority class, leading to misclassification and high-variance predictions in the minority class. This is particularly problematic in drug discovery. Limited data aggravates this issue, providing insufficient samples for effective learning, which results in poor generalization, especially for NRs with just a few labeled compounds.

When quantifying the class imbalance across different NRs, we found significant disparities. For instances, PPARA reveals an extreme class imbalance with only 15 negative samples compared to 1352 positive samples for binding activity prediction tasks or 14 negative samples compared to 1084 positive samples for agonist activity prediction tasks. Similarly, RXR is heavily imbalanced in agonist activity prediction tasks with only 263 positives and 4549 negative samples. This imbalance results in lower sensitivity and higher false-negative rates for these NRs, causing predictive models to underperform on the minority class.

Moreover, limited available data for specific NRs, such as PXR and RXR, also poses a significant challenge for predictive models. PXR, for instance, has only 1327 positive samples and 3866 negative samples for binding activity predictive tasks, leading to potential overfitting issues and reduced predictive performance. Similarly, RXR also suffers from limited data with only 1006 positive and 4569 negative samples for binding activity prediction tasks, which can hinder the ability of predictive models to generalize more effectively.

Meta-GTNRP mitigates the challenges of low-data and class imbalance by adopting a few-shot meta-learning approach based on model-agnostic meta-learning (MAML) [31]. In meta-training, this strategy learns across few-shot tasks (\(k\)-shot) for different NRs using random support sets of \((k+,k-)\) samples for training and a disjoint query set with the remaining samples for evaluation. Consequently, by considering support sets in a balanced manner with the same number of positive \((k+)\) and negative \((k-)\) samples, with each class is equally represented, Meta-GTNRP learns to handle imbalanced data more effectively. In addition, support sets include a small number of representative \(k\) samples for the model to learn from, making it well-suited for situations with limited available data.

Therefore, Meta-GTNRP applies a few-shot meta-learning approach to address the challenges posed by class imbalance and limited data in NR activity prediction. This meta-learning strategy facilitates the transfer of learned knowledge across NR-specific few-shot tasks in meta-training and the updated parameters are used to initialize Meta-GTNRP and generalize to new compounds for new NR tasks in meta-testing. This approach improves the performance of Meta-GTNRP, when there is limited data available and high class imbalance, without having a significant negative impact on predictive performance.

The performance of Meta-GTNRP on PPARA is notably distinct due to several challenges: the extreme class imbalance and limited data for PPARA hinders the ability to learn the minority class dependencies. In addition, the complexity introduced by the Transformer component exacerbates performance issues on such small and imbalanced data. PPARA’s unique ligand-binding domain, which interacts with diverse ligands, adds complexity to this prediction task, as the model struggles to capture this diversity with limited data. To improve performance on PPARA, strategies such as data augmentation, a PPARA-specific weighted loss function, or reducing model complexity could be considered. However, these adaptations may negatively impact the performance on other NR tasks, as the model might not generalize well to data with different distributions. Thus, it is essential to balance optimizing for PPARA with maintaining performance across all NRs.

The complexity of Meta-GTNRP also poses challenges for scalability, particularly with larger datasets and more diverse NR activity prediction tasks. Future research will focus on optimizing Meta-GTNRP for scalability by exploring techniques to improve efficiency and better manage computational resources. This will potentially make the model more adaptable to new NR-based drug discovery applications. Another potential limitation is the sensitivity of Meta-GTNRP to hyperparameter settings, requiring additional fine-tuning for broader generalization to new NRs. Consequently, employing alternative hyperparameter tuning methods and conducting sensitivity analyses will be crucial for achieving robust and generalizable performance across new and diverse NR activity prediction tasks.

Despite these challenges, Meta-GTNRP has the potential to transfer the learned knowledge among multiple NRs to help in the discovery of compounds that target other NRs involved in different biological processes. Meta-GTNRP can help researchers to accelerate drug discovery, making it more efficient to identify NR-modulators with limited data, which is crucial for developing therapies for multiple diseases. This makes Meta-GTNRP a useful tool in the field of computational drug discovery, offering new opportunities for the identification of NR-based drug candidates.

t-SNE visualization experiments in NR-binding activity prediction

To better show the effectiveness of our proposed approach in NR-binding activity prediction over graph-based baselines, we visualize the token embeddings \(h_T\) computed by Meta-GTNRP and graph embeddings \(h_G\) obtained by the GNN baselines across each one of 11 NR-binding tasks for 5-shot experiments. Therefore, we computed the t-distributed stochastic neighbor embeddings (t-SNE) [52] implemented in Scikit-learn with the following parameters: n_components = 2, perplexity = 50 and learning rate = 300 for Meta-GTNRP and standard GNN methods. The t-SNE cluster plots are displayed in Fig. 4 (5-shot) for 11 different NRs, where red dots denote positive samples and blue dots describe negative samples.

In Fig. 4, positive and negative compounds predicted by the baselines GIN, GCN and GraphSAGE are mixed up together, indicating that they have limited ability to distinguish between active and non-active binders for different NRs, respectively. Conversely, the Meta-GTNRP model obtains well-defined clusters of non-active binders progressively separating from active binders in the low-dimensional feature space for most NR-binding activity tasks. In addition, Meta-GTNRP shows clusters of negative datapoints (blue dots) closer to each other, well-separated from positive datapoints (red dots) with some overlapping to express a sense of global connectivity among active and non-active binders. Hence, for most NR-binding tasks, it is clear that Meta-GTNRP outperforms the GNN baselines in discriminating both positive and negative samples for NR-binding activity prediction.

Fig. 4
figure 4

t-SNE visualizations of token embeddings \(h_T\) generated by Meta-GTNRP and graph embeddings \(h_G\) obtained by GNN baselines for 5-shot experiments. In this figure, blue dots denote negative samples and red dots represent positive samples for each NR-binding activity task

Analysis of structural alerts in NR-binding activity prediction

Structural alerts (SA) are molecular substrutures which help to identify key molecular fragments and functional groups with an important role in NR-binding activity. These structural fragments are often used to indicate biological activity, but can also be used to illustrate a possible mode of action for a certain compound. Therefore, the combination of structural alerts and predictive models can offer a robust solution to understand the prediction through a more interpretable approach. In this case, we analyse the results obtained for the structural alerts to identify the key substructures responsible for NR-binding activity. In this experiment, we obtain different types of molecular substructures identified using predictions of Meta-GTNRP for each specific NR considering 10 other different NR tasks in meta-training. The significant molecular substructures are identified using Bioalerts [53], a Python package for the derivation of SAs using bioactivity data. The probability of a substructure to be a structural alert is given by the probability density function of the binomial distribution and \(p\)-values are calculated to assess the statistical significance using the predictions of Meta-GTNRP for each NR. The threshold frequency is set to 0.70, and the other parameters were set to default \((p-value \le 0.05, nb \ge 50)\). In Fig. 5, we show the main substructures obtained using the predictions of Meta-GTNRP for each specific NR significant in NR-binding activity 5-shot experiments. The structural features that were developed for the workflow are summarized below for each NR studied.

Fig. 5
figure 5

Analysis of structural alerts (SA) and representative molecular structures in NR-binding activity experiments. Each subfigure shows the significant substructures to determine NR-binding activity for each specific NR obtained using the predictions of the Meta-GTNRP model for 5-shot experiments

The structural alerts (SAs) identified for various nuclear receptors (NRs) reveal both shared and unique molecular features that influence the ligand NR-binding and NR activation. For PR, GR, and AR, several key structural elements are shared, reflecting their steroid-based nature. In PR, the 3-keto groups of most pregnane-based ligands are crucial for the ligand-receptor binding. These groups establish vital hydrogen bonds (H-bonds) with amino acid residues in the PR ligand-binding domain (LBD), making them essential for the PR activity. The removal or replacement of this 3-keto group significantly reduces the NR-binding activity [54,55,56,57]. Similarly, in GR, ketone groups play a central role, forming key H-bonds with the arginine and glutamine residues in the LBD, which enhances the NR affinity. The structural alerts for GR ligands also emphasize the importance of backbone ring structures with oxygen or nitrogen groups that contribute to more stable ligand-receptor interactions [58]. For AR, the 3-keto group and OH groups are critical for the androgenic activity, facilitating interactions with key amino acid residues such as the T877 AR side chains in the AR LBD [59]. Additionally, nonsteroidal ligands such as quinolones, hydantoin, and bicalutamide derivatives can also bind effectively to the AR, allowing for flexible structural modifications used to develop potential drug candidates for the treatment of androgen-sensitive prostate cancer [60, 61].

PXR and RXR ligands exhibit distinct features. PXR ligands are characterized by scaffold ring structures and oxygen functional groups, particularly ketones, which are essential for ligand-receptor interactions via H-bonding within the PXR ligand-binding pocket [58, 62]. In contrast, RXR ligands are generally lipophilic and include functional groups like double-bond oxygens or carboxylic acids that interact with arginine and serine residues in the RXR LBD [63]. These ligands typically contain aromatic or aliphatic ring structures, such as the cyclohexene group in retinoic acid, which aligns with the structural alerts identified for this specific NR [58].

ER ligands share similarities with those of RXR. Both ERA and ERB ligands rely on ring structures and oxygen or nitrogen functional groups to establish hydrogen bonds within the ER LBD, which are crucial for an effective ligand-receptor interaction [64]. These shared structural patterns across different NR families highlight the conserved features required for the NR activation.

FXR, however, displays more diverse structural alerts. While hydrogen bonding with arginine and histidine residues via carboxylic groups is a common feature, the SAs for FXR also show functional groups like nitrogen, sulfur, and halogens connected to aromatic and aliphatic rings [58, 65]. This diversity suggests that FXR can accommodate a broader range of chemical structures compared to other NRs.

Lastly, PPAR ligands represent a distinct class of NRs. The three PPAR isoforms (PPARA, PPARD, and PPARG) prefer diaromatic scaffolds with specific functional groups tailored to each PPAR isoform activity. Unlike the steroidal NRs, the PPAR ligands often lack a steroid backbone but still maintain structural motifs necessary for NR activation. For example, the PPARG agonists are used to manage insulin resistance, while PPARA and PPARD primarily regulate glucose metabolism. In addition, fatty acid- and retinoid-like ligands with moderate PPAR affinity are also observed among the identified SAs [58, 66].

The structural alerts and their representative structures obtained for different NRs identified key molecular fragments and functional groups playing significant roles in NR-binding activity. These substructures are critical for understanding the biological activity of compounds and their potential modulator properties for specific NRs. The most significant molecular substructures reveal both similarities and differences in the structural alerts across various NRs, highlighting distinct and common features that influence the NR-binding activity.

It is important to note that the frequency of ligand structures appearing across NRs can be attributed to several factors inherent to the original data and the biological nature of the NRs. Certain substructures, such as the 3-keto groups in PR, GR, and AR, appear frequently due to their critical role in establishing stable interactions through H-bonding within the NR ligand-binding domains. Conversely, the diversity of structural alerts for FXR suggests that its ligand-binding domain can accommodate a broader range of chemical structures. This variability of structural alerts is indicative of potential selective NR modulation and highlights the adaptability of these NRs to different chemical environments.

Conclusion

Nuclear receptors (NRs) are important biological targets that modulate the binding activity of drug-like compounds. In this work, the goal is to take into account the individual contribution of different NRs and leverage their complementarity to predict the NR-binding properties of compounds with high sensitivity and high specificity with imbalanced and limited data, which is crucial in drug discovery.

In this paper, we propose a few-shot GNN-Transformer, Meta-GTNRP to capture local information of molecular graphs and preserve the global structure of graph embeddings using a two-module meta-learning framework for NR-binding activity prediction with limited data. This few-shot learning strategy combines the information of 11 individual predictive tasks for 11 different NRs in a joint learning procedure to predict the binding, agonist and antagonist activity with just a few labeled compounds in highly imbalanced scenarios. The results yielded by Meta-GTNRP provide strong evidence that meta-learning is a data-efficient approach to model the NR-binding activity of compounds across few-shot tasks, when there is a limited data available, without having a negative impact on predictive performance. The ROC-AUC results show that Meta-GTNRP generalizes well to new NR tasks with a smaller variance, showing a superior performance over the standard graph-based methods. Hence, the proposed Meta-GTNRP framework is an effective method to predict the NR-binding properties of compounds through an optimized meta-learning procedure, delivering faster and more robust results with just a few labeled compounds. This approach can be used to identify potential NR-based drug candidates with limited available data, making Meta-GTNRP a valuable tool to accelerate the process of drug discovery and development.