Background

Cancer is a word used to describe the unusual growth of the cells leading to one of the most dangerous health problems for humans all over the world [1]. Despite the availability of improved drugs targeting cancer therapies, the worldwide cancer burden is expected to increase to 19.3 million new cancer cases, and nearly 10 million cancer deaths were observed in the year 2020 [2].

Breast cancer is the most common cancer among women all over the world and impermanence from breast cancer is commonly due to tumour metastasis [3]. It constitutes a major public health issue globally, with over 1 million new cases diagnosed annually; resulting in over 400,000 annual deaths and about 4.4 million women living with the disease [4]. The mortality rate of breast cancer among Nigerian women is about 16% [5].

Amino-thiophene derivatives were known to be one of the most important groups of heterocyclic compounds with a wide spectrum of biological activities such as antitumor [6] anti-mitotic [7] and antiviral [8]. Furthermore, thieno [2, 3-d] pyrimidine derivatives show anti-proliferative activity [9] while pyrazole derivatives have a specific effect with favourable antitumor activity [10]. Coumarin scaffold turn out to be an attractive subject due to their broad spectrum of pharmacological activities, its derivative is extensively explored for anticancer activities as it possesses minimum side effect along with multi-drug reversal activity [11]. Most pyridine derivatives had been synthesized as potentially biologically active compounds and had a multitude of pharmacological characteristics, in particular, anti-cancer activity [12,13,14].

Quantitative Structure Activity Relationship (QSAR) is one of the commonly used computational method for predicting the activities/properties of molecules in drug design as it saves time and lesser cost [15]. Generating a good QSAR model depends on factors such as: the quality of biological data, the choice of descriptors, variable selection, statistical methods and validations.

The aim of this research is to develop a good QSAR model for predicting the activity of some selected compounds against breast cancer and also design new compounds with better activities against breast cancer.

Methods

Data collection

The dataset used in this work was collected from the literature [16] and were reported as fifty percent growth inhibition (GI50) concentrations in (mmol L−1). These reported inhibitory activities were converted to logarithm scale to have a well-defined range with the help of Eq. (1) shown below.

$${\text{pGI}}_{50} = - \log_{10} ({\text{GI}}_{50} \times 10^{ - 3} )$$
(1)

Compounds sketching, optimization and descriptors calculations

The two-dimensional structure (2D) of the compounds were sketched using ChemDraw software version 12.0.2 [17], they were imported into Spartan 14 V.1.1.4 software to obtain the optimized three-dimensional spatial conformer (3D) at Density Functional Theory (DFT) level applying B3LYP 6-31G* basis set [18]. The optimized compounds in Spartan format were converted to SD file format and later imported into the PaDEL software to calculate the models’ descriptors.

Dataset normalization and pre-treatment

To give the descriptors equal chance of occurrence, the compounds were normalized using Eq. (2), [19]. The normalized data were pre-treated using the data pre-treatment software obtained from Drug Theoretic and Cheminformatics Laboratory (DTC Lab) to remove all empty columns and some useless descriptors [20].

$$X = \frac{{X_{i} - X_{\min } }}{{X_{\max } - X_{\min } }}$$
(2)

where Xi in the equation is the value of each descriptor for a given molecule and \(X_{\max }\) and \(X_{\min }\) are the maximum and minimum values for each column of descriptors X respectively.

Model generation and validation

In other to generate a good QSAR model, the pre-treated dataset was divided into training and test set in the ratio 7:3 by the means of data division software of DTC Lab [20]. The model was built using the training set, employing GFA-MLR method from the material studio. The test set was then used to validate the built model [21]. The suitability notch of the generated model was assessed using the lack of fit (LOF) [22], as in Eq. (3).

$${\text{LOF}} = \frac{{{\text{SEE}}}}{{\left( {1 - \frac{C + d*P}{M}} \right)^{2} }}$$
(3)

SEE being the Standard Error of Estimation, C is the number of terms in the model, d is a user-defined smoothing parameter, P is the total number of descriptors in the model and M is the number of training dataset. SEE can be expressed as:

$${\text{SEE}} = \sqrt {\frac{{\left( {Y_{\exp } - Y_{{{\text{pre}}}} } \right)^{2} }}{N - P - 1}}$$
(4)

where \(Y_{{{\text{exp}}}}\) and \(Y_{{{\text{pre}}}}\) are the experimental activity and the predicted activity in the training set respectively [22].

The squared correlation coefficient (R2) is a validation test used to match the predicted and experimental activities. The model would be considered robust with an R2 value close to 1. R2 is expressed as:

$$R^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{exp}}}} - Y_{{{\text{pred}}}} )^{2} }}{{\sum (Y_{{{\text{exp}}}} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(5)

where \(Y_{{{\text{exp}}}}\), \(Y_{{{\text{pred}}}}\) and \(\overline{ Y}_{{\text{training }}}\), were respectively the experimental activity, the predicted activity, and the mean experimental activity of the samples in the training set. The validity of the model cannot be based on R2 only, therefore an adjustment in the R2 would give a more reliable model. The adjusted R2 is givens by:

$$R_{{{\text{adj}}}}^{2} = \frac{{R^{2} - d(n - 1)}}{n - P + 1}$$
(6)

where d is the number of descriptors in the model and n is the number of training set compounds.The predictive power of the model is usually determined by the Cross-validation \((Q_{{{\text{cv}}}}^{2} )\) and the external validation test as expressed in Eqs. (7) and (8) respectively.

$$Q_{{{\text{cv}}}}^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{exp}}}} - Y_{{{\text{pred}}}} )^{2} }}{{\sum (Y_{{{\text{exp}}}} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(7)
$$R_{{{\text{test}}}}^{2} = 1 - \left[ {\frac{{\sum (Y_{{{\text{pred}}_{{{\text{test}}}} }} - Y_{{{\text{exp}}_{{{\text{test}}}} }} )^{2} }}{{\sum (Y_{{{\text{pred}}_{{{\text{test}}}} }} - \overline{Y}_{{{\text{training}}}} )^{2} }}} \right]$$
(8)

where \(Y_{{{\text{pred}}_{{{\text{test}}}} }}\) is the predicted activity, \(Y_{{{\text{exp}}_{{{\text{test}}}} }}\) is the experimental activity of the test set and \(\overline{Y}_{{{\text{training}}}}\) is the mean activity of the training set [21].

Y-randomization

Y-randomization is an external validation test performed to generate a new model from the bogus dataset so as to improve the models’ efficacy. For a good model, the randomized squared correlation coefficient (\({\text{cR}}_{{\text{p}}}^{2}\)) must be greater than 0.5, and is expressed as:

$${\text{cR}}_{{\text{p}}}^{2} = R[R^{2} - (R_{{\text{r}}} )^{2} ]^{2}$$
(9)

where \({\text{cR}}_{{\text{p}}}^{2}\) is the Y-randomization coefficient and \(R_{{\text{r}}}\) is the average ‘R’ of random models [19].

Applicability domain (AD)

Applicability domain is a theoretical region of the chemical space that is defined by the model descriptors, model response and nature of the training set. The leverage approach was employed to measure the data within the AD [23], any dataset that lies outside the AD would be treated as an outlier. Equation (10) is normally used to calculate the AD.

$$l_{i} = X_{i} (X^{{\text{T}}} X)^{ - 1} X_{i}^{{\text{T}}}$$
(10)

where \(l_{i}\) is the leverage of each compound, \(X_{i}\) is the descriptor row-vector of the query compound i, and X is the (m × n) descriptor matrix of the training set compounds used in building the model. The critical value (l*) is defined by Eq. (11).

$$l^{*} = 3\frac{p + 1}{n}$$
(11)

where p is the number of descriptors in the model and n is the number of objects used to develop the model.

Mean effect (ME) and variance inflation factor (VIF)

The mean effect is used to elucidate the comparative importance of each descriptor in the model while the VIF is used to determine the linearity between the descriptors in the model. VIF value of 1 show no linearity among the descriptors and value above 10 indicates a bad model. The ME and VIF are respectively calculated using Eqs. (12) and (13).

$${\text{ME}} = \frac{{B_{j} \mathop \sum \nolimits_{i}^{n} D_{j} }}{{\mathop \sum \nolimits_{j}^{m} \left( {B_{j} \mathop \sum \nolimits_{i}^{n} D_{j} } \right)}}$$
(12)

where \(B_{j}\) is the coefficient of the descriptor j in the model, \(D_{j}\) is the value of each descriptor in the data matrix for each of the training set data, m and n are respectively the number of descriptors that appears in the model and the number of molecules in the training set

$${\text{VIF}} = \frac{1}{{1 - R^{2} }}$$
(13)

where R2 is the multiple regression correlation coefficient between the variables in the model [24].

Molecular design

An In-silico approach of template-based design was employed to design new compounds with enhance activity against breast cancer. This method has been hired frequently to screen and modelled compounds with better-quality activity by relating the experimental activities of the compounds with their structures [25]. Henceforth, compound with the highest activity would be defined as the template to design new compounds with enhanced activities.

Results

All the tables and figures that describes the outcome of the built model and the designed compounds are presented in this section.

Discussion

All the thirty-four compounds used in this study were first sketched by ChemDraw to obtain the 2D structures, they were imported to the spartan 14 software to obtain their 3D optimised structures. The optimized dataset was normalized, pre-treated and the molecular descriptors were calculated with the help of PaDEL descriptor software. A large number of 1874 of molecular descriptors that are responsible for encrypting the important features of the structures were calculated.

The 2D structures and activities of the studied compounds were presented in Table 1. The Genetic Function Approximation (GFA), was used to generate four models, the first model out of the four models was selected as the optimum model since it best agrees with the minimum criteria for generating good QSAR model, reported in Table 2. Table 3 display the validation parameters for the generated models. Table 4 present the Y-randomization test used to affirm the strength of a model. This test was carried out on the training set by keeping the independent variable constant and randomizing the dependent variables. The low values of R, R2 and Q2 indicate the robustness of the generated model and the coefficient of Y-randomization (\({\text{cR}}_{{\text{p}}}^{2}\) = 0.8200) confirmed the generated model was not gotten by chance.

Table 1 2D structure and names of the dataset and their 50% growth inhibitory activities in mmol L−1
Table 2 Standard validation parameters for a good QSAR model
Table 3 QSAR model validations values
Table 4 Y-randomization

Table 5 displays the correlation matrix, VIF and the ME of the four descriptors used to build the models. The low value of the Pearson’s correlation indicates that there is no significant connection between the descriptors, this means that each descriptor gives different information that influenced the model. The relative importance of each of the descriptor in the model was measured with the low value of the Variance Inflation Factor (VIF) and since the VIF value were all less than 2, henceforth, the descriptors in the model were rightfully selected and the model is therefore said to be statistically satisfactory [24]. Meanwhile descriptor MaxHBd with highest positive ME value indicates its prominence in the models’ activity, as such, the descriptor was made the focal point when designing new enhanced compounds. The descriptor (MaxHBd), means Maximum E-States for (strong) Hydrogen Bond donors.

Table 5 Correlation matrix, VIF and mean effect (ME) for the QSAR model descriptors

Descriptive analysis was carried out to back up the evidence that the dataset was well divided into a new set (training set and test set). Table 6 present the maximum, minimum and standard deviation values for both training and test sets were very close suggesting no significant difference in them, as a result, we deduce that the training set is extrapolative within the test set, this confirm the fit of the Kennard and stone method employed in the data division.

Table 6 Descriptive analysis

Table 7 present the details of the descriptors used to build the model. The first two descriptors were 2D and the last two being 3D. The equations generated from the material studio software displayed below, indicates Eq. (1) as the best model when compared to the standard validation parameters for generating a good QASR model in Table 2.

Table 7 Details of the descriptors used in the model

Model 1


pGI50 = 0.709363893 * GATS8c − 4.252846824 * maxHBd − 0.063150018 * TDB10p − 0.153565552 * RNCS + 4.211504042;


Model 2


pGI50 = 0.772397640 * GATS8c − 4.191643628 * maxHBd − 0.000418849 * TDB10v − 0.153328698 * RNCS + 4.006452472;


Model 3


pGI50 = 1.245271529 * MATS7c − 0.826297651 * SpMax5_Bhi − 3.598436568 * maxHBd − 0.135122003 * RNCS + 6.680738880;


Model 4


pGI50 = 0.723219873 * GATS8c − 4.266084558 * maxHBd − 0.061174372 * Kier1 − 0.155935615 * RNCS + 4.726811020.

The difference between the predicted activity and the reported activity is the residual activity, which is presented in Table 8. The low residual values indicate that the predicted activities lie within the experimental activities, accounting for the high predicting power of the model. Figure 1 and 2 below shows the graphical plot of experimental activity against the predicted activity for both training and test set respectively, the R2 value of the two plots are satisfactory when compared to the recommended R2 value of a good QSAR model reported in Table 2. The plot of standardize residual versus experimental activity in Fig. 3, was used to check for any systematic error in the built model, it was found that the built model was free of systematic error since all it standardizes value lies within ± 2 unit. Figure 4 shows the Williams plot, the plot help to determine compounds that are either influential or outliers. Four compounds were found to be outliers because their leverage values were greater that the critical leverage (l* = 0.6) and those compounds shall not be considered while designing a new anti-breast cancer agent.

Table 8 Residual values for both training and test dataset
Fig. 1
figure 1

Plot of experimental activity against predicted activity of the training set

Fig. 2
figure 2

Plot of experimental activity against predicted activity of the test set

Fig. 3
figure 3

The plot of standardized residual against experimental activity

Fig. 4
figure 4

The Williams plot

In other to design more potent anti-breast cancer compounds, compound 10 (Fig. 5) with the highest reported activity (4.0458) was endorsed as the template. The most influential descriptor maxHBd (maximum E-state for Hydrogen bond donor), with mean effect of 0.8382 was investigated. To raise the hydrogen bond donor, H-bond acceptor and strong electronegative atoms (F, O and N) were attached to the appropriate positions, which lead to the design of six new compounds with enhanced 50% growth inhibitory activity as displayed in Table 9.

Figure5
figure 5

Template compound (C10), 2-(3-(oxazol-2-yl)-4-oxo-4, 5, 6, 7-tetrahydro-3H-cyclopent [4, 5] thieno[2,3-d] pyrimidin-2-yl) acetonitrile for the designed compounds

Table 9 2D Structures and 50% Growth Inhibitory activities (pGI50) of the design compounds

Conclusion

This research has effectively built a good QSAR model with high predictive power, using the descriptors maxHBd, GATS8c, TDB10p and RNCS. The Williams plot, outlined four compounds (outliers) that should not be considered for further computational study. The validation parameters used to generate the model as discussed above all passed the minimum recommendation for building a valid QSAR model. Descriptor maxHBd with positive mean effect value of 0.8382 was found to mostly influence the optimum model, and was chosen as the template that was then used to design six new compounds with better inhibitory activities. Three out of the six designed compounds were found to have pIC50 value (4.2118, 4.1688 and 4.2504) greater than the template and the rest of the design compounds. Conclusively, the research aim was achieved and the results of this work would serve as first-hand information to the pharmaceutical chemist, pharmacist and pharmacologist in the course of producing new drug against breast cancer.