Introduction

Multidrug resistance (MDR1) and breast cancer resistance protein (BCRP) highly contribute to drug absorption and distribution (1). In particular, these efflux transporters are expressed in the blood–brain barrier (BBB) and prevent the brain penetration of drugs (2,3,4,5,6). Therefore, in vitro screening to eliminate MDR1 and BCRP substrates is utilized to develop medicines for central nervous system (CNS) diseases.

Efflux activity prediction can help reduce the cost and time in drug discovery. MDR1 has a large binding pocket and recognizes various structurally diverse compounds (7). Therefore, the accurate prediction of substrates via molecular docking simulations using protein structural information remains challenging. Quantitative structure–activity relationships (QSAR) using machine learning techniques, a type of artificial intelligence (AI), have been used to improve absorption, distribution, metabolism, and excretion (ADME) properties in the drug discovery process (8,9,10,11,12,13,14,15). Several computational classification models have been reported to predict substrates and modulators of MDR1, BCRP, and other ABC transporters including various techniques (16,17,18,19,20,21,22,23,24,25,26). Few studies have presented regression models for predicting the efflux activity of MDR1 and BCRP (12, 27, 28).

Various machine learning algorithms are essential for exploring the best prediction model. The automated machine learning framework comprehensively investigates multiple algorithms and minimizes technical issues. This study used PyCaret, an open-source machine learning library in Python that automates the machine learning workflow with minimum coding, as a machine learning approach using molecular descriptors (29). In a recent study, the model for predicting the fraction of a drug unbound in plasma using PyCaret outperformed those using other automated frameworks (30). In contrast to descriptor-based approaches, graph convolutional networks (GCN) based on molecular graphs and convolutional neural networks (CNN) based on compound images have recently gained attention as a recent trend in machine learning (31,32,33).

The GCN for molecular properties is a powerful approach for implementing multi-task and multimodal learning. Multi-task learning attempts to learn multiple different tasks simultaneously and has been utilized for predicting ADME parameters (13, 33,34,35). Additionally, GCN can combine different types of information, such as the chemical structures of molecules and amino acid sequences of proteins, through multimodal learning. A recent study applied multimodal GCN to the classification of molecular properties (36). However, the effectiveness of multimodal GCN in predicting ADME properties, including transporter activity, remains unclear.

This study developed a predictive model for MDR1 and BCRP activities using descriptor-based machine learning and GCNs. The predictive performances of the GCN models, including multi-task and multimodal learning, were compared with those of multiple descriptor-based machine learning approaches. In addition, an ensemble of descriptor-based machine learning and GCNs was used to enhance predictivity.

Materials and Methods

Materials

The test compounds were prepared by Takeda Pharmaceutical Company (Fujisawa, Japan) to determine efflux activity in MDR1- and BCRP-expressing cells. All other reagents and solvents were of analytical grade or better and were commercially available.

In Vitro Permeability MDR1 and BCRP-Expressing Cells

The efflux ratio (ER) was determined by previously described methods (12). Test compounds solubilized in dimethyl sulfoxide (DMSO) were added to transport buffer (Hanks’ balanced salt solution with 10 mM HEPES, pH 7.4) at a final concentration of 2 μM (DMSO < 1%) on either the apical or basolateral side of the transwell chamber with Madin–Darby canine kidney (MDCK)-MDR1 from NIH and MDCK-BCRP from Solvo Biotechnology (Szeged, Hungary).

The confluent cell monolayers on the transwell were incubated for 1 h at 37°C with 5% CO2. The test compounds were quantified by liquid chromatography–tandem mass spectrometry (LC-MS/MS; Applied Biosystems, Foster City, CA, USA). The ER and permeation of the test compounds from the apical to basolateral (A to B) or B to A direction were determined. The apparent permeability coefficient Papp (cm/s) was calculated using the following equation:

$${\mathrm{P}}_{\mathrm{app}}=\frac{{\mathrm{dC}}_{\mathrm{r}}}{\mathrm{dt}}\times \frac{{\mathrm{V}}_{\mathrm{r}}}{\mathrm{A}\times {\mathrm{C}}_{0}}$$
(1)

where dCr/dt is the cumulative concentration of the compound in the receiver chamber as a function of time (µM/s); Vr is the volume of the solution in the receiver chamber (0.075 mL on the apical side, 0.25 mL on the basolateral side); A is the surface area for transport, i.e., 0.0804 cm2 for the monolayer area; and C0 is the initial concentration in the donor chamber (µM).

The ER was calculated using the following equation:

$$\mathrm{ER}= \frac{{\mathrm{P}}_{\mathrm{app},\mathrm{ B to A}}}{{\mathrm{P}}_{\mathrm{app},\mathrm{ A to B}}}$$
(2)

Data Preparation

The ER for 9490 and 3440 compounds in MDR1- and BCRP-expressing cells, respectively, were used as the proprietary internal dataset. As data splitting, we conducted a time split. The entire dataset was chronologically divided into a training set (compounds used until May 2022) and a test set (from June 2022 to March 2023). As an external dataset, the ER for 46 and 47 compounds were determined in MDR1- and BCRP-expressing cells, respectively.

Descriptor-Based Approach

Various molecular descriptors were generated using alvaDesc (1.0.16) (Alvascience Srl, Lecco, Italy), which provides physicochemical properties such as lipophilicity, polarity, molar refractivity, and pharmacophore. The 3D descriptors and descriptors with N/A were removed. The generated 2D descriptors were applied to the feature selection.

The Boruta algorithm was applied as the feature selection method, a wrapper around the random forest algorithm to identify essential features for further analysis (37, 38). After the feature selection, PyCaret 2.3.6 was used for data splitting, model selection, and hyperparameter turning (29). Twenty-five machine learning algorithms, including linear regression (lr), lasso regression (lasso), ridge regression (ridge), elastic net (en), least angle regression (lar), lasso least angle regression (llar), orthogonal matching pursuit (omp), Bayesian ridge (br), automatic relevance determination (ard), passive aggressive regressor (par), random sample consensus (ransac), TheilSen regressor (tr), Huber regressor (huber), kernel ridge (kr), support vector machine (svm), K-nearest neighbors (knn), decision tree regressor (dt), random forest regressor (rf), extra trees regressor (et), AdaBoost regressor (ada), gradient boosting regressor (gbr), MLP regressor (mlp), extreme gradient boosting (xgboost), light gradient boosting machine (lightgbm), and CatBoost regressor (catboost), were employed for model building. For model selection, tenfold cross-validation was applied only to the training datasets. Model performance was assessed using the percentage within a twofold error, the coefficient of determination (R2), and root-mean-squar error (RMSE) values. The percentage within a twofold error is the percentage of predicted values within twofold of the observed values and is used to evaluate the prediction acceptability. R2 and RMSE were calculated using the following equations:

$$R^2=1-\frac{\sum_{k=1}^n\left(\mathrm{Observed}\;\mathrm{value}-\mathrm{Predicted}\;\mathrm{value}\right)^2}{\sum_{k=1}^n\left(\mathrm{Observed}\;\mathrm{value}-\mathrm{Mean}\;\mathrm{value}\right)^2}$$
(3)
$$\mathrm{RMSE}=\sqrt{\frac1n\times\sum\nolimits_{k=1}^n\left(\mathrm{Predicted}\;\mathrm{value}-\mathrm{Observed}\;\mathrm{value}\right)^2}$$
(4)

Hyperparameter tuning was performed for the model showing the best performance in tenfold cross-validation. The number of iterations in the grid search for hyperparameter tuning was set to 100. Finally, the model performance was evaluated using test and external datasets.

Graph-Based Approach

The training and test sets from the descriptor-based approach were used in the graph-based approach. We used a GCN comprising two graph convolutional layers and two linear transformation layers (39). We used the atom type, degree, hybridization, aromaticity, formal charge, number of implicit Hs on the atom, number of radical electrons of the atom, and total Hs on the atom as atom features. The percentage within a twofold error, R2, and RMSE were used to evaluate model performance. Integrated Gradients was used to interpret which chemical substructures influenced ER prediction (40).

The model was trained on 80% of the training dataset for 200 epochs, with a batch size of 1024. The mean absolute error (MAE) was used as the loss function. Adam was used as an optimization algorithm (41). Hyperparameters of Adam, including learning rate, exponential decay rates, epsilon (a parameter for numerical stability), and weight decay (L2 penalty), were optimized using Optuna to maximize the accuracy of the other 20% of the training dataset (42). The number of optimization trials was set to 100. We used the Captum library to perform Integrated Gradients (43).

Multi-task learning was applied to build a single model to predict MDR1 and BCRP activities for the same compound. The model architecture was the same as that of the single-task GCN, and the output layer size was changed to two. The loss function, optimization algorithm, and method of hyperparameter tuning were the same as in the single-task GCN.

Multimodal learning utilizes the chemical structures of drug molecules and amino acid sequences of the two transporters. Chemical structures were encoded using two graph convolutional layers into 128-dimensional feature vectors. The amino acid sequences of MDR1 and BCRP were also encoded with one one-dimensional convolutional layer and one linear transformation layer into 32-dimensional feature vectors. After concatenating the two types of feature vectors, two linear transformation layers predicted the ER from the feature vectors. The training and hyperparameter tuning methods were the same as those for the single-task GCN.

For graph-based approaches, PyTorch (44), Deep Graph Library (45), and DGL-LifeSci (46) were used. We used the GCN Predictor implementation in DGL-LifeSci with default parameters as the GCN architecture. To illustrate the molecules, we used RDKit, an open-source cheminformatics software (47).

Results

Dataset Analysis

The ER histogram did not follow a normal distribution, and the ER was log-transformed to reduce the skewness of the measurement variable (Fig. 1). ER distribution was comparable between the time-split training and test sets. The distribution pattern of the BCRP ER in the external dataset differed from that in the in-house dataset (Fig. S1).

Fig. 1
figure 1

Distribution of MDR1 and BCRP ER and log(ER) values in the training set (solid column) and test set (open column). Values represent the relative frequency of MDR1 ER a and log(ER) b in MDR1 and ER c and log(ER) d in BCRP. The relative frequency was calculated by dividing a frequency count by the sum of all frequencies

Descriptor-Based Approach

The predictive performance of the 25 machine learning models was compared using tenfold cross-validation for each training set. The model performance was evaluated using the percentage within a twofold error, R2, and RMSE. The CatBoost showed the best performance in the time-split training set (Table I). The ER predictivity based on CatBoost was evaluated using an independent test set from the training set. The R2 and RMSE values were 0.673 and 0.424, respectively, and 59.6% of the predicted ER values were within twofold of the observed values (Table II). Tree-based models enable the calculation of the feature importance, and the top 10 important features of the MDR1 ER prediction model were investigated (Table S1).

Table I Predictive Performance of Models Using alvaDesc Descriptors Based on 10-Fold Validation
Table II Predictive Performance of Machine Learning Models and Graph Convolutional Networks

Support vector regression produced the best results for BCRP ER prediction in the time-split training set (Table I). The R2 and RMSE values were 0.536 and 0.333, respectively, in an independent time-split test set, and 63.9% of the predicted ER values were within twofold of the observed values (Table II).

Compared with the in-house time-split test set, the predictive performance in the external dataset was poor. In the external MDR1 and BCRP datasets, the percentages within the twofold error were 41.3% and 36.2%, respectively (Fig. S3).

Graph-Based Approach

The predictive performance of the GCN is shown in Table II. The model performance was evaluated using an independent test set from the training set. The R2 values of the MDR1 and BCRP ER predictions for the test set were 0.651 and 0.484, respectively. In the external MDR1 and BCRP dataset, the predictive performances were poorer than that in the in-house time-split test set, and the percentage within twofold error was 47.8% and 55.3%, respectively (Fig. S2). The chemical substructure contributing to the predicted ER was visualized using Integrated Gradients in the external dataset (Fig. S4).

The results of multi-task and multimodal learning are shown in Table II. Multi-task learning did not improve ER prediction accuracy. Meanwhile, based on R2 and RMSE, multimodal learning improved the predictive performance of BCRP ER.

Ensemble Approach

The ensemble model was developed to utilize the advantages of both the descriptor-based and graph-based approaches. The average values of the predicted ER with descriptor-based machine learning and the GCN were used for the final predicted ER. In both tasks, the accuracy was higher than that of the descriptor- and graph-based approaches (Table II). The R2 values of the MDR1 and BCRP ER predictions were 0.706 and 0.587, respectively, in the time-split test set. The predictive performance in the external dataset was lower than that in the in-house time-split test set, and 50.0% and 55.3% of the predicted MDR1 ER and BCRP ER values, respectively, were within twofold of the observed values (Figs. 2 and 3).

Fig. 2
figure 2

Observed versus predicted ER in MDR1 (a) and BCRP (b) for the ensemble approach in the time-split test set. The solid line represents the line of unity. The dashed lines represent a twofold deviation

Fig. 3
figure 3

Observed versus predicted ER in MDR1 (a) and BCRP (b) for the ensemble approach in the external set. The solid line represents the line of unity. The dashed lines represent a twofold deviation

Discussion

The physiologically based pharmacokinetic (PBPK) model has been used for in vivo intestinal absorption and brain permeability and integrates quantitative ER values in each process. In particular, we reported that unbound brain-to-plasma partitioning (Kp,uu,brain) could be predicted using the ER in MDR1 and BCRP. An accurate ER can predict Kp,uu,brain by incorporating it into the PBPK model.

The automated machine learning framework using PyCaret minimizes technical hurdles. In the descriptor-based prediction of the MDR1 ER, CatBoost, an open-source gradient boosting library developed by Yandex (48), showed the highest predictivity in the time-split test set. Researchers from various fields have successfully utilized CatBoost for machine learning using big data since 2017 (7). However, there are no reports on the prediction of ADME-Tox parameters using CatBoost. In contrast, in the descriptor-based prediction of the BCRP ER, support vector regression, a popular machine learning algorithm widely used for the classification and regression of ADME properties in several studies (49, 50), showed the highest predictivity in the time-split test set. The comprehensive automated machine learning framework can help to select the best algorithm from traditional and contemporary algorithms.

Previous research has demonstrated that the graph-based approach is superior to descriptor-based prediction for multiple ADME parameters (34). However, the MDR1 and BCRP ER predictive performance of GCN was poorer than that of the descriptor-based approaches in our study. We used a simple GCN architecture comprising two graph convolutional layers and two linear transformation layers. Meanwhile, recent sophisticated models, such as graph attention networks (GATs) (51) and message passing neural networks (MPNNs) (52), can enhance predictive performance. Additionally, many features are available for each atom in a molecule that captures both the electrons’ properties and the bonds in which the atom participates. Modifying atomic features and introducing bond features may improve the performance of graph-based approaches.

Previous research has demonstrated that the multi-task GCN is superior to the single-task GCN for predicting intrinsic clearance and solubility (31). However, herein, the multi-task GCN showed poorer performance than the single-task GCN, presumably due to the imbalance in the dataset size between the two tasks, which produces many missing values, particularly in the smaller dataset (i.e., the BCRP dataset in this study). The possibility that many missing values lower the performance of multi-task learning has been reported in previous research (53). To improve the performance of multi-task learning, datasets with few missing values may be required.

In contrast, multimodal learning may help overcome the challenge of imbalanced ADME datasets. Although the improvement of predictive performance using multimodal learning was not observed in MDR1 ER prediction, multimodal learning improved the predictivity of the BCRP ER (Table II and Fig. S2). This may be because multimodal learning overcomes the limited data problem by combining the information from two modalities. In our case, the BCRP dataset was smaller than that of MDR1, and multimodal learning solved the difficulty of the limited BCRP dataset. Meanwhile, the study’s multimodal learning approach is constrained by the number of protein sequence types. This study only used the protein sequences of two transporters, indicating that the constructed model has only been categorically informed that the transporters are different, and the model’s ability to recognize similarities and disparities between the protein sequences of diverse transporters has not been tested. Thus, further studies using datasets of efflux activities, including more diverse transporters, are required to investigate the utility of multimodal learning.

In both descriptor-based machine learning and molecular graph-based approaches, the predictive performances in the external dataset were lower than those in the in-house time-split set, indicating that the coverage of chemical space differed between the two datasets. In practice, descriptor-based PCA plots demonstrated that some compounds in the external dataset did not fit the distribution observed in the in-house training set (Fig. 4). The applicability domain must be considered to estimate the model’s reliability and coverage.

Fig. 4
figure 4

Primary component analysis of the in-house training set, test set, and external dataset. Primary component 1 (PC1) versus primary component 2 (PC2) in MDR1 (a) and BCRP (b) data. The in-house training data are indicated in purple, the in-house test data in green, and the external data in yellow

Furthermore, we applied Integrated Gradients to the GCN models to interpret which compound substructures contributed to the predicted ER. Several studies have reported substructures that affect the MDR1 and BCRP ERs using a fingerprint-based machine learning approach (54,55,56,57). Herein, using GCN, the primary amine moiety was frequently recognized to increase the ER, whereas hydroxyl groups (especially secondary alcohols) and halogens (Cl and F) were occasionally recognized to decrease the ER (Fig. S4). While hydroxyl groups and halogens are common in MDR1 non-substrates (55, 57), the role of primary amines in efflux activity has not been reported in the context of machine learning. QSAR studies have reported that substructures of inhibitors, such as amino groups, fluorine, and chlorine, can be recognized by ABC transporters (58), although the action of the inhibitors on the transporters may not be the same as that of the substrates. Additionally, some contributing substructures differ between the fingerprint-based machine learning approach in previous studies and the graph-based approach in this study; for example, nitrile and thial groups frequently occur in non-substrates (54). These findings suggest that combining multiple approaches, including fingerprint-, descriptor-, and graph-based approaches, would strengthen knowledge about the chemical structure contributing to ER. GCN enables structure-based interpretation, thereby improving the efficiency of structural optimization in the drug discovery process.

In this study, the ensemble model achieved higher predictive performance than the descriptor- and graph-based approaches (Table II). Adding molecular descriptors to a graph convolutional model improves the predictive performance for molecular property prediction (59). The molecular graph-based approach learns the relationship between the chemical structures and efflux activity. In contrast, the descriptor-based approach uses many kinds of molecular descriptors, such as constitutional, topological, and pharmacophore, in addition to chemical structures, including functional groups and fragment counts. Among the top 10 features extracted from the MDR1 ER prediction model, molecular descriptors directly expressing chemical structures were limited, such as the frequency of C–N at topological distance 1 and the number of double bonds (Table S1). The ensemble approach can consider both types of features, leading to high predictive performance.

Conclusions

This study successfully developed prediction models for ER in MDR1- and BCRP-expressing cells using molecular descriptor-based machine learning and graph-based approaches. Multimodal learning outperformed the other GCN approaches in predicting the BCRP ER. Because GCN enables the visualization of atomic contributions to the prediction result, this information would be useful for clarifying the structure spot that increases ER and improving ER in the drug optimization process. Finally, ensemble approaches combined with descriptor-based machine learning and GCN improved the prediction of both MDR1 and BCRP ERs, enabling early decision-making in compound prioritization.