Introduction

Epidermal growth factor receptor (EGFR) is a transmembrane protein in the tyrosine kinase receptor family [1]. When the extracellular domain of EGFR binds to its ligands (such as EGF or TGF-α), it leads to receptor dimerization and activation, thereby triggering the intracellular tyrosine kinase activity of EGFR [2]. This activates a series of downstream signalling pathways, such as the MAPK, PI3K/Akt, and JAK/STAT pathways, which regulate cell growth, proliferation, migration, and survival [3].

Aberrant activation of the EGFR signalling pathway is associated with the occurrence, progression, and responsiveness of various cancers to treatment [4]. For example, genetic mutations or amplifications of EGFR are common in non-small cell lung cancer and are associated with EGFR sensitivity to tyrosine kinase inhibitors [3]. In addition, other factors related to the EGFR signalling pathway, such as miR-20b [5], HOXA3 [6], and JMJD8 [7], have been found to play roles in cancer that are related to the activation of the EGFR signalling pathway.

To inhibit excessive activation of the EGFR signalling pathway, fourth-generation EGFR inhibitors have been developed (Fig. 1). First-generation EGFR inhibitors (such as erlotinib and gefitinib) primarily target activated EGFR mutants, although the T790M mutant exhibits resistance to these inhibitors [8], and can cause toxic side effects when targeting wild-type EGFR [9]. Second-generation EGFR inhibitors (such as afatinib and dacomitinib) are designed to target the T790M mutation. However, they may lead to more side effects [10, 11] (such as rashes and gastrointestinal disorders) due to their inhibition of wild-type EGFR, which has limited their further clinical application [12]. Third-generation EGFR inhibitors (such as osimertinib) specifically target the T790M mutation and have weaker inhibitory effects on wild-type EGFR, and therefore, fewer side effects [13,14,15]. Fourth-generation EGFR inhibitors, currently in clinical stages, primarily target triple mutants, including C797S (such as Del19/T790M/C797S and L858R/T790M/C797S) [16]. In conclusion, the discovery of novel fourth-generation EGFR inhibitors has significant research implications for overcoming triple mutants mediated by C797S [17].

Fig. 1
figure 1

Features of fourth-generation EGFR inhibitors

In recent years, machine learning techniques have been widely applied in various fields, such as biology, medicine, and computer science. In the field of drug discovery, machine learning has also been extensively used; for example, to overcome drug resistance using machine learning models [18], the broad-spectrum active compound ruthenium(II) N-heterocyclic carbenes [19], diarylurea antibiotics [20] and anticancer drugs [21] have been discovered.

This study utilized ROC-guided machine learning, virtual screening, and bioactivity evaluations to discover potential compounds targeting EGFRT790M/C797S and EGFRT790M/C797S/L858R. The DUD-E website was used to construct training datasets and evaluate three machine learning models, and the best model was ultimately selected for potential compound prediction. Receptor selection, virtual screening, cluster analysis, and binding analysis were performed, and six promising compounds were identified. After conducting kinase assays, antiproliferative activity tests, AO (Acridine Orange) staining, cell cycle experiments, qRT–PCR, and haemolytic activity evaluations, Compound T001-10027877 was identified as a novel EGFRT790M/C797S/EGFRT790M/C797S/L858R inhibitor. The entire framework of the study is illustrated in Fig. 2.

Fig. 2
figure 2

The machine learning, virtual screening and bioactivity evaluation framework

Methods and materials

Data preparation

Dataset preparation for machine learning

First, 100 known active compounds (Table S1) were obtained from Selleck (https://www.selleck.cn/) and used as a reference to ensure the quality of the datasets used for machine learning. Then, based on the SMILES files of the active compounds, decoys were obtained according to the DUD-E website [22] (https://dude.docking.org/). The DUD-E database processed these files and returned a pool of decoy molecules. From this pool, we methodically selected one molecule out of every seven, ultimately forming a set of 220 decoys. The active dataset (100 compounds) and the decoy dataset (220 compounds) were amalgamated to create a dataset, encompassing a total of 320 compounds (Table S2). This consolidated dataset was subsequently converted into an SDF file to facilitate advanced analysis of its physicochemical properties.

Moreover, the physicochemical properties of the active and inactive molecules were predicted using KNIME [23] (https://www.KNIME.com/). This process (Figure S1) utilized modules such as SDF Reader, RDKit From Molecule, RDKit Descriptor Calculation, and CSV Writer to predict the physicochemical properties of the 320 compounds. Notably, before the predictions were made, a “class” row was added for classification, where the active molecules were given a label of 1 and inactive molecules were given the label 0. This step is crucial for ensuring the accuracy and reliability of both data processing and the resultant predictions.

Compound library preparation

First, three compound libraries (including the FDA, Bioactivity and Specs libraries; over 220000 compounds) were uploaded into the cheminformatics software KNIME to remove the same molecules and then to predict the physicochemical properties of those remaining (a similar method was described in section “Dataset preparation for machine learning).” These physicochemical properties were then used as features for further machine learning to predict potential compounds.

Machine learning and the prediction of potential active compounds

Selecting the best-performing machine learning model

Using the data mining and analysis software Orange3 [24], we evaluated five machine learning models (KNN, SVM, random forest, neural network and gradient boosting) on the training datasets in section “Dataset preparation for machine learning”. The machine learning parameters were set as follows: (1) 70% of the datasets were used for training, and 30% were used for testing; (2) the number of cross-validations was set to 10-fold; and (3) the other parameters were kept at their default values. We considered AUC, CA, FA, precision, and recall to select the best-performing machine learning model for predicting active compounds.

Predicting potential active compounds with the random forest model

For this section, we utilized the top-performing machine learning model, random forest, to forecast the per-compound “class” values for the compounds in the FDA library and Specs library. It is crucial to clarify that these scores, referred to as “class” values, do not carry any units. These class values represent the likelihood of a compound being classified within a specific category, with scores ranging from 0 to 1. Our primary goal was to pinpoint compounds whose “class” values were close to 1 or exceeded 0.5, which indicated a higher probability of relevance to our research objectives. Compounds meeting these criteria were subsequently identified as significant findings for the forthcoming phase of our virtual screening study.

Virtual screening for hit compounds

Docking software and protein selection

Recent research [25] has shown that Vina boasts comparable performance to other open source and academic software programs, including LeDock [26], Gold [27], Moe [28], and Glide [29]. Furthermore, Vina has proven to be a valuable and potent tool in numerous docking research projects aimed at identifying new active compounds. As a result, Vina was chosen as the docking software for this virtual screening.

Our primary aim was to identify compounds capable of targeting EGFR mutants, specifically those with dual or triple mutations. This objective necessitates the careful selection of proteins for virtual screening to ensure the identification of highly specific and effective inhibitors. Therefore, three proteins, 5ZWJ (EGFRT790M/C797S/V948R mutant), 6JRJ (EGFRT790M/C797S mutant) and 6LUD [30,31,32] (EGFRL858R/T790M/C797S mutant), were evaluated to select suitable proteins for virtual screening. Using these three targets, the original protein ligands were extracted so that they could be docked into the active site again. The RMSD and alignment model were used to evaluate the sensitivity of Vina towards these three proteins. Finally, the complex with the best RMSD and binding affinity was selected as the docking acceptor for the virtual screen. The docking parameters are listed in Table 1. The affinity, determined from the docking results, and the aligned model were used as rules to select suitable docking receptors.

Table 1 Docking parameters of the proteins ZWJ, 6JRJ and 6LUD

Compound normalization and preparation

In section “Compound library preparation”, we detailed the normalization process for hit compounds identified through machine learning using KNIME software. This workflow (Figure S2) includes several modules: File Reader, RDKit Canon SMILES, Duplicate Row Filter, RDKit Add Hydrogens, RDKit Generate Coordinates, RDKit Optimize Geometry, and SDF Writer. These steps systematically prepare the hit compounds by adding hydrogens, generating molecular coordinates, optimizing geometries, and ultimately saving the data in SDF format. This meticulous preparation culminates in converting the compound library into PDBQT files, making them ready for subsequent applications.

Virtual screening

First, the protein was processed in Discovery Studio 3.0 to remove water molecules, ions, and the original ligand and to correct the residues. Subsequently, the PDBQT formats of both the protein and molecules were utilized as input according to the parameters specified in the configuration file outlined in section “Docking software and protein selection”. Next, the compounds with the highest affinity were preserved for cluster analysis using Discovery Studio 3.0. Ultimately, the potential molecules were chosen based on analysis of both the cluster library and binding model. Detailed methodological details of the virtual screening can be found in our previous work [33,34,35].

EGFR kinase assay

The kinase inhibition assays were conducted by Heyan Biopharmaceutical Technology Co., Ltd. (Wuhan). The compounds were tested for their inhibitory activity against EGFRT790M/C797S/L858R and EGFRT790M/L858R using the ADP-GLO and Lance Ultra methods. Detailed methodological details can be found in our previous work [36].

Antitumour evaluation

To initiate the experiment, four cell lines (A549, H1975, H460 and Ba/F3-EGFRL858/T790M/C797S) in the logarithmic phase of growth were carefully placed into 96-well plates and allowed to incubate for 24 h (the A549, H1975, H460 and Ba/F3-EGFRL858/T790M/C797S cells used in our study were obtained from Heyan Biopharmaceutical Technology Co., Ltd. (Wuhan), as per the guidelines and protocols approved for research). Subsequently, 20 µL of compound-containing medium was evenly distributed into each well, with five different concentrations of each compound and three replicates for each concentration. The final concentrations of the six compounds, D008-10022050, D008-10206173, D008-10099272, D008-10135583, T001-10026427, and T001-10027877, were 100.0, 33.3, 11.1, 3.7, and 1.2 µM. After 72 h of incubation, MTT solution (5 mg/ml) was added to each well, and the plates were allowed to incubate for four hours. Once the incubation period ended, the MTT solution was removed, and DMSO (150 µl) was added to each well. Finally, an enzyme marker was used to measure the optical density (OD) of each well at 492 nm.

To determine the antiproliferative effects of the compounds on tumour cells, the measured optical density values were used. The inhibition rate (%) was calculated using the following formula: (OD control - OD treatment)/(OD control - OD blank) × 100%. Using SPSS software, we calculated the half-maximal inhibitory concentration (IC50) for each of the compounds on the tumour cells.

AO staining analysis

Acridine orange (AO) is a dye that interacts with DNA and RNA by intercalation or electrostatic attraction [37]. In living cells, AO primarily stains the nucleus green; in dead or damaged cells, AO can penetrate the cell and bind to DNA, staining the nuclei of dead cells orange or red and thus differentiating between living and dead cells. For this assay, T001-10027877 was administered at concentrations of 0.5, 1, and 2 µM, with 1 µM AZD9291 serving as the positive control and untreated cells serving as the blank control. The entire experimental method has been described in our previous research [38].

Cell cycle

H1975 cells were seeded at a density of 4.0 × 105 cells per flask and cultured at 37 °C in a 5% CO2 incubator for 24 h. The drugs were dissolved in 20 µL of DMSO and then diluted in culture medium to the desired concentrations (AZD9291: 1 µM; and T001-10027877: 0.5 µM or 1 µM) before being added to the culture flasks. After incubation, the cells were then digested with trypsin, centrifuged at 1000 rpm for 5 minutes to remove the supernatant, and washed twice with PBS. Next, 10 mL of 70% ethanol was added at 4 °C to fix the cells for more than 24 h. The fixed cells were centrifuged at 2000 rpm for 5 minutes, the supernatant was discarded, and the cells were washed twice with PBS before 300 µL of PI staining solution was added. After incubation in the dark at 4 °C for 30 min, the cell cycle distribution was analysed using flow cytometry, and key data were retained.

Migration experiment

To conduct the cell scratch assay, log-phase H1975 cells were seeded into a six-well plate, with each well receiving approximately 5 × 105 cells. Before seeding, parallel and equidistant lines were marked on the underside of the plate to serve as guides for the scratch. After 24 h, the cells had reached confluence, and a pipette tip was used to create a precise scratch along the premarked lines. Subsequently, the plate was rinsed twice with PBS to remove any detached cells, and fresh culture medium was added. At this stage, the cells were treated with T001-10027877 at concentrations of 0.5, 1, and 2 µM, and a control group was treated with 1 µM AZD9291. Initial observations and photographs of the scratched area were taken under a microscope to establish baseline data (0 h). Culture was then continued in the incubator for 24 h. After the plate was retrieved, the cells were observed and photographed under a microscope again to document the migration and healing process, with particular attention given to the effects of T001-10027877 and AZD9291 on cell migration.

Real-time PCR

Using 1 µM AZD9291 as a positive control, fluorescence quantitative PCR was utilized to investigate the concentration-dependent inhibition of the substances on EGFR and mTOR in H1975 cells. The concentrations of T001-10027877 used were 0.5 and 1 µM. The procedure was described by Sun et al. [38].

Haemolytic activity assay

A haemolytic activity assay was conducted to evaluate the potential haemolytic effects of T001-10027877. The concentrations tested for T001-10027877 included 0 (as a control), 16, 32, 64, 128, and 256 µg/mL. In this assay, 1% Triton X-100, which is known to cause 100% haemolysis of sheep blood erythrocytes, was used as a positive control to indicate complete cell lysis. Saline solution served as a negative control, representing no haemolytic activity. The specific experimental method was based on our previous research [38].

Results

Machine learning to predict potential compounds

Selection of the machine learning model

To select the best machine learning model, the ROC method was used. First, five machine learning models (KNN, SVM, random forest, neural network and gradient boosting) were built with Orange3 to evaluate the training set to identify the best model, and the results are described in Table 2; Fig. 3. The AUC values can be calculated from Table S3. According to Table 2; Fig. 3, the AUCs of the random forest (0.98) and SVM (0.98) models were better than those of the other three models (KNN (0.93), neural network (0.97) and gradient boosting (0.97)). Among the random forest and SVM models, the random forest model also performed better in terms of CA, F1, precision and recall. Based on these results, the random forest model was selected as the machine learning model in this study.

Fig. 3
figure 3

ROC curves of five machine learning models: KNN (A), SVM (B), random forest (C), neural network (D) and gradient boosting (E). The AUCs of the five machine learning methods are summarized in Fig. 3F

Table 2 Performance of the five machine learning models

Using the random forest model to predict potential compounds

After the physicochemical properties of the compounds in the library were predicted, the data mining software Orange3 was used to construct a random forest model to predict the probability of target inhibition. Compounds with a positive rate greater than 0.5 were retained, which resulted in approximately 5105 compounds (Table S4). Then, these compounds were processed by KNIME to add hydrogen atoms, generate coordinates and optimize the geometry and saved as an SDF file. Finally, the SDF file was converted into PDBQT by Open Babel 3.0 for further study. The workflow is shown in Fig. 4.

Fig. 4
figure 4

Workflow of the random forest model prediction and form conversion of the compound library

Virtual screening

Selecting suitable proteins for virtual screening

To select suitable virtual screening software, the three mutant proteins were selected as receptors, and the original ligand of each was redocked into the active pocket to evaluate the RMSD. Thus, the three proteins were used as receptors, and AutoDock Vina was used to dock the original ligand into the active pocket to evaluate the sensitivity of Vina towards the three proteins. As shown in Table 3; Fig. 5, the original ligand in 6JRJ showed better affinity (affinity value: -11.4 kcal/mol) than those in 5ZWJ (affinity value: -8.7 kcal/mol) and 6LUD (affinity value: -8.0 kcal/mol). In addition, among the three ligands, the RMSD of the aligned original ligand of 6JRJ (RMSD: 0.18) was better than that of 5ZWJ (RMSD: 0.35) but close to that of 6LUD (RMSD: 0.07). Taking these results into consideration, clearly, protein 6JRJ was more suitable for virtual screening than 5ZWJ and 6LUD.

Table 3 RMSDs after redocking the original ligand relative to that of the original protein
Fig. 5
figure 5

The original ligands (red) aligned with the redocked ligands

3.2.2 Virtual screening and hit compound selection

After selecting suitable docking software and proteins, approximately 5105 compounds were docked into the active site of the protein 6JRJ, and after setting a threshold affinity of less than − 9.5 kcal/mol, 284 compounds remained (Table S5). Then, cluster analysis was performed to obtain 45 compounds based on the diversity of the skeletons. Finally, based on the binding model compared with the original ligand of 6JRJ, 6 compounds were selected as hit compounds, as displayed in Table 4. In addition, the 6 selected compounds were aligned with the original ligand of 6JRJ, and nearly all showed conformations more similar to the original ligand than did the 39 unselected compounds, as displayed in Fig. 6. These results indicated that the selected compounds may inhibit EGFRT790M/L858R and EGFRT790M/C797S/L858R. Finally, all selected compounds were purchased from TargetMol (US).

Table 4 Docking information for the 6 selected compounds
Fig. 6
figure 6

(A) The original ligand of protein 6JRJ aligned with the 39 unselected compounds. (B) The original ligand of protein 6JRJ aligned with the 6 selected compounds. It is important to highlight that these compounds were chosen from among 45 candidates identified through a clustering process

EGFRT790M/C797S/L858R and EGFRT790M/L858R inhibition assays and binding model analysis

To prove our suspicions, kinase assays were performed to evaluate whether the hit compounds were on target. Table 5 shows that EGFRT790M/C797S/L858R exhibited increased sensitivity to nearly all the compounds. For example, three compounds (D008-10099272, D008-10135583 and T001-10027877) inhibited EGFRT790M/C797S/L858R with IC50 values of 3.26, 3.25 and 4.32 µM, respectively, and only one compound inhibited EGFRT790M/L858R (T001-10027877: 1.27 µM). Additionally, compared to AZD9291, which served as a positive control with IC50 values of 0.64 and 0.012 µM against EGFRT790M/C797S/L858R and EGFRT790M/L858R, respectively, T001-10027877 clearly demonstrated less potent kinase inhibition. However, the unique structural architecture of T001-10027877 offers significant potential for future structural modifications. Therefore, T001-10027877 emerged as a promising lead compound for devising strategies to counteract EGFR-induced resistance, especially by targeting the T790M and C797S mutations.

Table 5 Kinase inhibitory activity of six selected compounds against EGFRT790M/C797S/L858R and EGFRT790M/L858R

In addition, T001-10027877 was selected for binding analysis, using D008-10206173 as the negative control, as shown in Fig. 7. According to recent research, the residues M790, L858, S797 and M793 are important for EGFR inhibitors to overcome resistance mediated by T790M or C797S. Figure 7A shows the robust inhibitory activity of AZD9291 against EGFR resistance, as evidenced by the formation of four hydrogen bonds with the crucial amino acids K745, S797, and M793 and the establishment of five pi-cation interactions with other amino acids. This demonstrates its high efficacy in targeting the EGFR protein. On the other hand, Fig. 7B shows that T001-10027877 exhibits a diverse set of interactions, including hydrogen bonding, hydrophobic, and pi-cation interactions, notably enhancing its binding to the S797 residue via additional interactions with M790, S796, and M793. Despite this broad interaction spectrum, the potency of T001-10027877 did not reach that of AZD9291.

Further analysis of D008-10206173 (presented in Fig. 7C) shows that this compound forms primarily hydrophobic and pi-stacking interactions, which are inherently weaker than hydrogen bonds. This comparison revealed that T001-10027877 demonstrates greater activity than D008-10206173 but remains less potent than AZD9291. The similar interaction patterns of T001-10027877 and AZD9291 with the protein, particularly in overcoming EGFR resistance caused by the T790M or C797S mutation, suggest that T001-10027877, despite its reduced potency compared to that of AZD9291, still holds potential as a candidate for overcoming resistance mediated by specific EGFR mutants. Such integrated analysis emphasizes the nuanced efficacy of these compounds in targeting EGFR mutations, highlighting the ability of T001-10027877 to serve as an alternative for combating EGFR resistance.

Fig. 7
figure 7

Analysis of the molecular interactions of AZD9291-6JRJ, T001-10027877-6JRJ and D008-10206173-6JRJ

Antiproliferative activity

Furthermore, the antiproliferative activity of the six selected compounds was evaluated by the MTT method with four cancer cell lines (A549, H1975, H460 and Ba/F3-EGFRL858/T790M/C797S). These cell lines were chosen to encompass a broad spectrum of EGFR mutation statuses: A549 and H460 as wild-type EGFR controls, H1975 representing the EGFRT790M/L858R mutations associated with acquired resistance to first-generation EGFR tyrosine kinase inhibitors, and Ba/F3-EGFRL858/T790M/C797S to model compound efficacy against the complex scenario involving the T790M, L858R, and C797S mutations. As shown in Table 6, all the compounds inhibited the growth of A549, H1975, H460 and Ba/F3-EGFRL858/T790M/C797S cells to different degrees, with IC50 values ranging from 1 to 50 µM. Moreover, nearly all the compounds were more sensitive to the H1975 cell line, which implies that these selected compounds could be used as potential compounds for overcoming resistance. Notably, the T001-10027877 showed 1- to 3-fold greater activity than the positive control compound AZD9291 (4.95, 1.55, 5.38 and 4.34 µM) against A549, H1975, H460 and Ba/F3-EGFRL858/T790M/C797S cells, with IC50 values of 1.7, 1.55, 3.51 and 4.46 µM, respectively. Notably, only T001-10027877 and AZD9291 inhibited the Ba/F3-EGFRL858/T790M/C797S cell line, with IC50 values of 4.46 µM and 4.34 µM, respectively. These data indicate that compared to AZD9291, T001-10027877 shows similar or even superior inhibition of Ba/F3-EGFRL858/T790M/C797S cells harbouring the triple mutation, as well as the double-mutant H1975 cells. This finding highlights the potential of T001-10027877 inhibiting the proliferation of cancer cells with the T790M and C797S mutations in EGFR.

Table 6 The antiproliferative activities of the six selected compounds in A549, H460 and Ba/F3-EGFRL858/T790M/C797S cells

AO staining assays

Our goal was to identify a potential compound to overcome the resistance induced by the EGFR mutations T790M and C797S. Moreover, T001-10027877 demonstrated greater inhibition of the EGFR T790M mutant than the C797S mutant. Consequently, the H1975 cell line was chosen for assessing the antiproliferative effects using the AO staining assay at AO concentrations of 0.5, 1, and 2 µM (Fig. 8). As shown in Fig. 8, the H1975 cells in the control group exhibited a normal morphology, emitting circular and dense green fluorescence. Notably, both the positive control (AZD9291) and T001-10027877 induced significant shrinkage of the cell membrane and the appearance of sharply defined cell edges, indicating the induction of apoptosis. At a concentration of 2 µM, T001-10027877 induced apoptosis to the same extent as AZD9291. Moreover, as the concentration of T001-10027877 increased, its ability to induce H1975 cell apoptosis also increased.

Fig. 8
figure 8

Cell morphology observed by fluorescence microscopy after treatment with T001-10027877

Effects of T001-10027877 on H1975 cell cycle progression

To explore the impact of Compound T001-10027877 on cell proliferation, cell cycle analysis was conducted in H1975 cells. As shown in Fig. 9, Compound T001-10027877 induced cell cycle arrest at the G0/G1 checkpoint in a dose-dependent manner compared to that in the control group. As the concentration of Compound T001-10027877 increased, the S phase population increased from 17.01 to 28.84%, while the G2/M phase population decreased from 29.02 to 11.54%. The proportion of cells in G0/G1 phase did not significantly change. Additionally, at a concentration of 1 µM, Compound T001-10027877 caused cell cycle arrest at the S checkpoint at a rate of 28.84%, which is similar to that of the positive control AZD9291 (29.10%).

Fig. 9
figure 9

Compound T001-10027877 induces H1975 cell cycle arrest in a dose-dependent manner

Inhibitory effects of T001-10027877 on H1975 cell migration

A cell scratch assay was used to investigate the effect of T001-10027877 on the migratory capacity of H1975 cells. The cells were treated with 0.5, 1 or 2 µM T001-10027877 for 24 h and then photographed under a microscope. As shown in Fig. 10, after 24 h of culture, the cells in the control group significantly migrated towards the centre. The cells treated with T001-10027877 migrated less than those in the control group, and this inhibitory effect was more potent at 2 µM T001-10027877 than at 0.5 µM T001-10027877. The inhibitory effect of T001-10027877 was similar to that of AZD9291 at a concentration of 1 µM. These results demonstrated that T001-10027877 had a dose-dependent inhibitory effect on the migratory capacity of H1975 cells and exhibited efficacy similar to that of the positive control AZD9291.

Fig. 10
figure 10

Dose-dependent inhibition of H1975 cell migration by Compound T001-10027877: A comparative study with AZD9291

Quantitative reverse transcription polymerase chain reaction (qRT–PCR) analysis after treatment with T001-10027877

To determine the mechanism of action of Compound T001-10027877 in tumour cells, fluorescence quantitative PCR was used to examine the EGFR and mTOR expression levels in H1975 cells. In untreated H1975 cells, the expression levels of EGFR and mTOR were at baseline levels, as shown in Fig. 11. Treatment with 0.5 µM T001-10027877 effectively reduced the relative EGFR expression to 0.51 and the relative mTOR expression to 0.64. Increasing the concentration of T001-10027877 to 1 µM further decreased the relative EGFR expression and mTOR expression to 0.35 and 0.53, respectively, demonstrating dose-dependent inhibition. At 1 µM, the positive control AZD9291 reduced the relative EGFR expression to 0.28 and that of mTOR to 0.52, slightly surpassing the effects of T001-10027877 in terms of EGFR suppression but comparably affecting mTOR expression. These results highlight the potent, dose-responsive inhibitory effect of T001-10027877 on the EGFR and mTOR signalling pathways, as T001-10027877 shows marked selectivity for EGFR at relatively high concentrations. Furthermore, T001-10027877 demonstrated similar selectivity and efficacy towards EGFR as the positive control AZD9291, indicating comparable levels of expression and activity.

Fig. 11
figure 11

Quantitative RT‒PCR analysis of EGFR and mTOR expression in H1975 cells treated with T001-10027877

Haemolytic activity of T001-10027877 at various concentrations

Moreover, haemolytic activity was also considered. As shown in Fig. 12, the haemolytic activity values were 0, 1.69, 3.78, 7.13, 7.84 and 8.76 for T001-10027877 at 0, 16, 32, 64, 128 and 256 µg/mL, respectively. All of the haemolytic activity values at various concentrations were less than 9%, which means that compared with the control, 1% Triton X, T001-10027877 was safe in terms of haemolytic activity for the treatment of cancer cells.

Fig. 12
figure 12

Evaluation of the haemolytic safety of Compound T001-10027877: A dose‒response study

Discussion

Herein, we emphasize the multifaceted evaluation of T001-10027877, which spans from machine learning model selection to detailed molecular interaction analysis, highlighting the potential of this compound as a novel agent for overcoming EGFR-mediated resistance in cancer therapy.

Initially, the selection of the random forest model via the ROC method, as documented in Table 2; Fig. 3, provided a solid foundation for computational screening. This model outperformed the others, including SVM, KNN, neural network, and gradient boosting, by demonstrating superior accuracy in predicting compounds with potential inhibitory activity against EGFR mutants. Such precision led to the identification of approximately 5105 compounds, a testament to the model’s efficacy in navigating the complex chemical space of potential inhibitors.

Redocking studies further refined our search, pinpointing protein 6JRJ as the most suitable receptor for virtual screening due to its affinity for and sensitivity towards the ligands, as shown in Table 3; Fig. 5. This specificity in target selection was crucial for identifying 284 compounds with significant affinity, which were further narrowed down to six hit compounds through cluster analysis, emphasizing the importance of structural diversity in the selection process.

A comparative analysis of these compounds revealed that T001-10027877 had notable inhibitory activity, particularly against the EGFRT790M/L858R and EGFRT790M/C797S/L858R mutants, showing efficacy similar to or superior to that of the positive control, AZD9291. This is supported by molecular interaction studies (Fig. 7A and B), where T001-10027877 exhibited a comprehensive interaction profile with key EGFR residues, suggesting its ability to effectively inhibit EGFR-driven cell proliferation.

Further experimental validation via kinase assays and antiproliferative activity tests across multiple cancer cell lines (A549, H1975, H460, and Ba/F3-EGFRL858/T790M/C797S) confirmed the potent activity of T001-10027877. Notably, T001-10027877 demonstrated 1- to 3-fold greater activity than did AZD9291 in certain assays, highlighting its potential to serve as a more effective therapeutic agent.

The antimigratory effects of T001-10027877, as evidenced by the cell scratch assay, and its ability to induce cell cycle arrest, as demonstrated via cell cycle analysis, further substantiated the efficacy of this novel compound. These results, together with the fluorescence quantitative PCR findings, revealed that T001-10027877 dose-dependently inhibited EGFR and mTOR expression, revealing its targeted mechanisms of action against cancer cell proliferation.

Finally, the evaluation of haemolytic activity confirmed the safety of T001-10027877, indicating that it is a viable candidate for further development. The minimal haemolytic effects of this compound, even at high concentrations, align with the requirements for a therapeutically effective yet safe anticancer agent.

In summary, the comprehensive assessment of T001-10027877, from computational predictions to experimental validation, showcases its promise as a novel therapeutic option for overcoming EGFR-mediated resistance. Its potent efficacy, coupled with its favourable safety profile, makes it a compelling candidate for further investigation and potential clinical application in cancer therapy.

Conclusion

In this study, an in silico fusion method, which included machine learning, virtual screening and clustering, combined with activity evaluations was used to identify active EGFR inhibitors targeting EGFRT790M/C797S/L858R. First, ROC-guided machine learning was used to identify nearly 5105 compounds from three compound libraries (over 200000 compounds). Second, virtual screening and binding model analysis were performed to obtain six potential compounds. In addition, the kinase assay showed that EGFRT790M/C797S/L858R exhibited greater sensitivity to these six compounds than EGFRT790M/L858R. Among them, Compound T001-10027877 inhibited both EGFRT790M/C797S/L858R and EGFRT790M/L858R, with IC50 values of 4.34 and 1.27 µM, respectively, which means that T001-10027877 could overcome the EGFR resistance mediated by the T790M or C797S mutations. Moreover, the antiproliferative effects of the six hit compounds were also investigated, and T001-10027877 showed the best anticancer activity in the H1975, A549, H460 and Ba/F3-EGFRL858/T790M/C797S cell lines, with IC50 values of 1.7, 1.5, 3.5 and 4.46 µM, respectively. Furthermore, AO staining and cell cycle experiments revealed that Compound T001-10027877 could induce H1975 cell apoptosis in a concentration-dependent manner. Moreover, the haemolytic activity evaluation suggested that the Compound T001-10027877 is safe for further in vivo study.