1 Introduction

After heart disease, cancer is the seconding cause of mortality amongst mankind, in 2018 about 18.1 million cases were diagnosed. Of all the cancer disease, mammary tumor is frequently amongst women and is the second death related disease amongst women folk. After the collective effort towards concurring this disease, the tumor remains a big challenge globally [7]. In the survival of mammary tumor there is a bigger difference globally, with an approximation of 5 years of 80% in the advanced countries to below 40% for emerging countries. Advancing countries battle with resources and infrastructural limitations that challenges the course of successful mammary tumor outcome by early detection, diagnosis and management [4].

Computer-aided drug discovery (CADD) and designing ensures the best possible lead compound, it reduces the cost related to discovering a drug and it also reduces the time taken for the drug to pass through other stages before its ready for usage. It is a fundamental way in the drugs discovery arena. CADD techniques ascertains principal molecule by assessing, predicting the potency, the probable side effect and also assist in correcting drug-likeliness of the compounds [11]. Drug compounds that have low drug-likeness and ADMET properties wont progress to pre-clinical research, irrespective of the high biological activity. ADMET is one of the main properties used in analyzing a drug compound, though a significant progress was made when a lot of consideration was given to such properties recently [17].

There has been an increase of breast cancer occurrences, which is still the most substantial cause of mortality among the female being. Despite the headway made in managing breast cancer, the search for a curative treatment is still ongoing as most times, the tumor becomes resistance to this treatment with a short time. Although a number of crucial studies and clinical trials have significantly contributed to the enhancement of mammary tumor care, many cancer cases and pathway often remain yet unknown to the majority of clinicians [19]. Recently [14] reported the anti-proliferative activities of some novel compounds of Parviflorons derivatives against MCF-7 cell line. This study is aimed at building a mathematical QSAR, design new Parviflorons compounds based on a derived QSAR model and to furthermore ascertain the pharmacokinetic properties of the newly designed drug compounds.

Luminal type breast cancer (MCF-7) are Estrogen receptor (ER)-/progesterone receptor (PR)—positive type which are caused by the over expression of estrogen receptor α (ERα). It accounts for about 70% of the mammary tumor patients tagged as ER positive (ER +). The constant activation of ERα by estrogens induces the proliferation of cancer cell [12].

2 Methodology

2.1 Computational Information

2.1.1 Hardware and Software

The computer details used in this research is; 7th generation HP pavilion Intel R, core i7-7500u RAM 12.00 GB running on a windows 10 operating system. The software’s used to carry out this research includes Spartan’14 (version 1.1.2), Material studio (V8), Auto dock visualizer version 4.2, Pyrex software version, Chemdraw software version 12.0.2, PADEL-Descriptor Software V2.20 and DTC data lab software and Microsoft word Office Excel 2013 version.

2.2 QSAR Assessment

2.2.1 Data Gathering

Twenty-six (26) novel derivative compounds of Parviflorons derivatives against MCF-7 cell line with their anti-proliferative activities reported in inhibitory concentration (IC50), against breast cancer (MCF-7) cell line were reported from [14] article.

2.2.2 Anti-proliferative Activities and Geometry Optimization

The IC50 values were normalized to pIC50 using scale of logarithm {pIC50 = − log10 (IC50 × 10–6)}. The tabulated anti- proliferative activities (IC50) and pIC50 of the derivatives are shown in Table 1, measured in concentration of micro molar (µM). QSAR analysis requires intensive attention for the whole job to be executed. At the start, drawing the structure is a crucial step for the calculation of molecular descriptors as the independent variables. In this research, Chemdraw V (12.0.2) was used in drawing Parviflorons derivatives and converted to 3D format for geometric optimization using Spartan 14 V (1.1.4) software, using Density Functional Theory (DFT) with B3LYP, 6-311G basis set, for the geometric optimization of the compounds [3]. The aim of optimization is to acquire a more appropriate 3-dimentional structure that is very close with the original 3-dimentional molecular structure. Therefore the molecular parameters may well represent the main physicochemical properties of the observed molecule [15].

Table 1 Parviflorons derivatives and its activities

2.2.3 Molecular Descriptors Calculations and Pretreatment

26 derivative compounds of Parviflorons were converted to SDF format after optimization. Pharmaceutical Data Exploration Laboratory Software V (2.20) was used in calculating physicochemical descriptors [18]. The descriptors were pretreated using Data Pre-treatment software GUI 1.2 [1] to remove irrelevant values.

2.2.4 Division of Data Set and Model Building

Kennard-Stone algorithm [13] method was utilized to distribute the derivatives into training or calibration set and test or validation set to build the model [2]. The calibration set is used to develop a calibrated model that would be used in predicting the bio-activities of the validation set of molecules. Version 8 of Material studio software was utilized in constructing a mathematical model with Genetic Function Approximation (GFA) technique. The dependent variable is the anti- proliferative activities (pIC50) and the independent variable are model parameters (descriptors) which were obtained using Pharmaceutical Data Exploration Laboratory Software V (2.20).

2.2.5 Model Validation (Internal)

Internal validation employs the derivative compounds used in generating the model and checks for core effectiveness. Cross-Validation (CV) procedure is commonly utilized as an internal validation technique for the derived model, mostly one compound from the train set is removed, The n-1 (n = the total molecules) molecules are utilized in building the model using the calibration or train set. The anti-proliferative activity of the compound removed is calculated once, the method is repeated n times for every molecule, thus every molecule having a calculated activity [6]. Such procedure is known as leave-one-out (LOO) technique. It’s given as:

$$ Q^{{2}}_{{{\text{cv}}}} \, = \,{1}\left[ {\frac{{\sum \left( {Y_{pred} - Y_{exp} } \right)^{2} }}{{\sum \left( {Y_{exp} - Y_{mintraining} } \right)^{2} }}} \right] $$

Ytraining Yexp, and Ypred are the average activities (pIC50) of training set, bio-activities (IC50) and prediction inhibition concentration of the train set [5]. The coefficient of correlation for the cross-validated technique R2 is given as:

$$ {\text{R}}^{{2}} \, = \,{1} - \left[ {\frac{{\sum \left( {Y_{exp} - Y_{pred} } \right)^{2} }}{{\sum \left( {Y_{exp} - Y_{training} } \right)^{2} }}} \right] $$

Where Yexp and Ypred are averages of the actual and predicted activity of the training sets [16]. It is a research tool used in estimating the prediction power of the statistical model that was acquired from a regression technique.

2.2.6 Model Validation (External)

A built model with excellent good fit and an approved prediction can still be faulty in an actual relationship between (model descriptors) predictor variables and (bio-activity) response variables. The degree of potency of the built model (equation) is analyzed by external validation, it calculates the degree of fitness of the model. The criteria proposed by Golbraikh and Tropsha for an effective built model with good predictive power are stated as follows;

  • a. R2pred > 0.6

  • b. < \(\frac{{r}^{2}-{r}_{o}^{2}}{{r}^{2}}\)0.1

  • c. 0.85 < k < 1.15 or 0.85 < k′ < 1.15.

where r2 is the squared correlation coefficient between the actual and calculated activity, \({r}_{o}^{2}\) is the correlation coefficient squared between the actual and calculated activity, and k and k′ are the regression slopes passing through the origin [2].

2.2.7 QSAR Applicability Domain of Model

The goal of an applicability domain methods is for estimating individually, the reliability of each generated model [8]. A model validation should be within the training domain and its essential for the compounds to be assessed as fitting within the domain to ascertain the model. An applicability domain is evaluated by the leverage value for every molecule. The leverage (L) defines the applicability domain of the generated equation [20]. It is formulated as;

$$ L_{i} \, = \,x_{i} \left( {X^{T} X} \right)^{ - k} x_{i}^{T} (i\, = \,K, \ldots , \, P). $$

Where XT is the matrix transpose of X used in constructing the model, Xi is matrix of train compounds of I and X is the n x k matrix of train set descriptors. (H*) is the warning leverage, it is a prediction tool that checks for outliers. It’s written as;

$$ H*\, = \,\frac{{3\left( {p + 1} \right)}}{m} $$

p equals to the total structural descriptors and m is the total compounds of train sets. The William’s plot (A plot of standardized values versus the leverage values) of both the training (calibration) and test (validation) set. Molecules that fall within the warning leverages on the plot are the predicted compounds.

2.2.8 Computational Pharmacokinetics (Drug-Likeness)

SwissADME was used in analyzing the drug-likeness of the newly designed compounds. Furthermore, the designed compounds was checked for their compliance with Lipinski's rule of five [10], a well-used criteria to comprehend if a compound can be orally absorbed or not, such as: molecular weight (MW) ≤ 500, octanol/water partition coefficient (AlogP) ≤ 5, number of hydrogen bond donors (HBDs) ≤ 5 and number of hydrogen bond acceptors (HBAs) ≤ 10.6. According to the Rule of Five, a drug compound would not be orally active if it violates two or more of the four rules [9].

3 Results and Discussion

3.1 Insilico QSAR Investigation

Insilico QSAR investigation was used in finding a simple mathematical equation that was used in calculating an enhanced anti-proliferative activities from structures of Parviflorons derivatives. The QSAR investigation also correlated the molecular descriptors (model parameters) with the physico-chemical properties of the 26 derivative compounds (bio-activities) using statistical techniques. Based on the Genetic Function Approximation (GFA) technique employed, four QSAR models were generated to predict the anti-proliferative activities of Parviflorons derivatives. Model 1 (one) passed both internal and external validation with correlation coefficient squared (R2) of 0.9444, correlation coefficient adjusted squared (R2adj) of 0.9273, cross validation coefficient (Q2) of 0.8945. The external validation of (R2pred) of 0.6214 for model 1 was calculated using the model descriptors from the test set as shown in Tables 2 and 3. The robustness of the QSAR models were assessed using the reliability of the train set and predicted pIC50 of the test set, which agrees with the criteria proposed by Golbraikh and Tropsha (R2pred > 0.6) for an effective QSAR model as shown in Table 5.

Table 2 External validation of model 1
Table 3 Continuation of descriptor calculation used in external validation of model 1

Model 1

Y = 0.157509087 × nX + 110477157080 × MATS3e − 1.703000586 × GATS5e + 0.574341593 × MLFER_BO + 5.589384.

The robustness of the QSAR models were assessed by the reliability of the calibration set and calculated pIC50 of the validation set. The Experimental, predicted and the residual values of Parvifloron derivatives are shown in Table 4. The low residual value is obtained from the difference between the anti-proliferative and calculated activity, indicating the high predictive power of the model. Both internal and external validation conforms model 1 to be very stable and highly effective.

Table 4 The bio-activities (pIC50), calculated activities and residual values of model 1

Table 5 defines the model parameters (descriptors) in the calculated model, the descriptors were used in verifying the model both internally and externally. They were calculated using PADEL-Descriptor Software V2.20 from (Abdullahi et al. [2]).

Table 5 Definition of descriptors and their classes for the model

The effectiveness and predicting power of the generated model was assessed using internal and external validation analysis, the model conformed with the least approved QSAR model values, indicating that model can be used in designing new Parvifloron derivatives compounds with better anti-breast cancer activity as seen in table 6.

Table 6 Golbraikh and Tropsha approved QSAR model standards

Statistical analysis was used in evaluating the individual contribution of each molecular descriptor in the QSAR model, i.e. the Mean effect and VIF (Variance Inflation Factor). The coefficient of the mean effect values are used to either increase or decrease the effect of the descriptors. Therefore, increasing nX, GATS5e and MLFER_BO would increase the bio-activities of the derivative compounds (positive coefficient) while decreasing MATS3e would also increase the bio-activities of the derivative compounds (negative coefficient) as proven in Table 7. VIF (Variance Inflation Factor) gives a degree of the inter-relationship amongst the model parameters. The VIF scores were within the approved value of 1–5, indicating that there is no co-linearity between the bio-activity and model parameters (descriptors) of the constructed model, as shown in Table 7.

Table 7 Statistical analysis of model 1 parameters

Figure 1 shows a graph of observed activities against the calculated activities of both the test set and the train set of compounds. The plot showed that the predicted activity was in good agreement with its experimental values as shown in Table 2, conforming to the effectiveness and stability of the model generated.

Fig. 1
figure 1

plot of predicted activities versus inhibition concentration

Figure 2 shows the values of both test and train set spread on both sides of zero point on the plot, showing no systematic errors between the standardized residual versus the anti-proliferative activity (Experimental activity).

Fig. 2
figure 2

A graph of standardized residual against bioactivities (Experimental activities)

Figure 3 shows the standardized residuals against the leverage values also called William's plot. Most of the compounds fell within the applicability domain from the calculated leverage of (L = 0.833), only 3 compounds we found outside the applicability domain which might be due to a slight changes in their molecular structure as compared with other molecules in the data set.

Fig. 3
figure 3

The William’s plot

3.2 Ligand-Based Drug Design

Eight (8) new Parviflorons derivative compounds were designed using the ligand based approach. The lead compounds (4 and 16) were chosen due to their low residual values and high pIC50 values as shown in Table 4. This approach uses the molecular descriptors obtained from the mathematical QSAR model and adjustments were made on the lead compounds (4 and 16) based on the definition of the molecular descriptors nX, having a positive coefficient (this mean adding either of the halogen atoms, which includes F, Cl, Br, I etc. at different structural positions) and GATS5e also having a positive coefficient (this also means adding electronegative compounds such as OH, OCH3 etc.) as shown in Table 5. The newly designed compounds and their new calculated activities are seen visually in Table 8.

Table 8 Newly designed imidazole derivative compounds with their new predicted activities (pIC50)

3.3 Physicochemical and ADME Properties (Pharmacokinetics) of Designed Parvifloron Compounds

There are lot of designed compounds that fail to become drugs. Efficiency and safety of the drug to the system are the main cause of drug failure, these indicates the absorption, distribution, metabolism, excretion, and toxicity (ADMET) properties of compounds plays a major role in every step of the drug discovery pipeline. Therefore, it is compulsory to discover potent compounds with effective ADMET properties (Guan et al. [9]). All the newly designed compounds were assessed for their drug-likeliness (ADME and physicochemical properties). None of the designed compounds violated two rules out of the Lipinski rule of five; a prominent principle used in certifying the drug-likeness of a compound, this shows that all the designed compounds passed the drug-likeness test as shown in Table 9, making the compounds a breakthrough in finding the cure to triple-negative breast cancer. Figure 4 shows the bioavailability radar for molecules 1 and 6. The Bioavailability Radar gives an initial scan at the drug-likeness of the compound.

Table 9 Physicochemical and ADME properties (Pharmacokinetics) of designed imidazole derivative compounds against MCF-7 cell line
Fig. 4
figure 4

the bioavailability radar for molecules 1 and 6

4 Conclusion

Parvifloron derivatives showed a more promising anti-breast cancer drug candidate against MCF-7 cell line via QSAR studies and pharmacokinetics analysis. Based on the statistical analysis from the mathematical model obtained from QSAR studies showed that increasing nX, GATS5e and MLFER_BO descriptors will increase the anti-proliferative activities of Parvifloron derivatives while decreasing MATS3e would also increase the anti-proliferative of Parvifloron derivatives as a standard anti-breast cancer drug agent. The effectiveness and predicting power of the generated model was assessed using internal and external validation analysis, the model conformed with the least approved values, indicating that model can be used in designing new Parvifloron derivatives compounds with better anti-breast cancer activity. The molecular descriptors (nX and GATS5e) had more significant and based on their mean effect, adjustment were made on the fragments of the lead compounds (4 and 16) to design eight new Parvifloron derivative compounds with a higher predicted activity against MCF-7 cell line.

Furthermore, the pharmacokinetics analysis (drug-likeliness test) carried out on the newly designed Parvifloron compounds revealed that all the compounds passed drug-likeness test (ADME and other physicochemical properties) and they also adhered to the Lipinski rule of five: a criteria used in evaluating the drug-likeness of compounds. This concludes that the compounds can move on to the next step of pre-clinical trial, showing a great discovery for medicine in finding permanent solutions to breast cancer (MCF-7 cell line).