Introduction

The efficacy of radiotherapy, one of the most common modalities used to combat cancer [1], is impaired by the fact that solid tumour cells suffer from hypoxia, which make them radioresistant [2]. In order to overcome this therapeutically undesirable effect, radiosensitisation is necessary. One may mention several classes of cellular radiosensitisers that have been investigated so far [3], of which thymine analogues as 5-bromouracil (BrU)/5-iodouracil (IU) [4] and other halogenated nucleobases [5, 6] seem to be especially promising since they work under hypoxia. Their radiosensitising action is related to the irreversible and swift elimination of halide anions (X) from the modified nucleobase anion [7, 8] that is formed due to solvated electron (ehyd) attachment. The latter species, the second (beside hydroxyl radicals) most abundant product of water radiolysis [9], are amply stable under hypoxia, but native DNA is not sensitive to them [5]. This situation is different for the DNA with incorporated radiosensitiser, where the elimination of X leaves a reactive nucleoside radical in the biopolymer. As a consequence, secondary reactions comprising hydrogen atom transfer within the radical nucleoside [10, 11] or between the radical and adjacent nucleoside [12] lead ultimately to a serious DNA damage such as single-strand breaks [13], cyclopurine lesions [14] and DNA cross-links [15].

Despite the abovementioned advantages of halogenated nucleosides, they are not routinely employed in anticancer treatment. Even in 5-bromodeoxyuridine (BrdU)/5-iododeoxyuridine (IdU), the derivatives most thoroughly investigated in clinical trials were not introduced into practice due to only marginal therapeutic effects in patients [16, 17]. Probably one of the reasons that inhibit the development of this otherwise tempting approach to anticancer treatment is the lack of electrophilic nucleosides with appropriately optimised features. Recently, we have undertaken attempts for rational designing of 5-substituted uracils prone to the dissociative electron attachment (DEA) process [18, 19]. As a result, we proposed a series of 5-modified uracils with the DEA characteristics much better than those of BrdU [18, 19] and the superior characteristics of some of them were confirmed in the photoelectron spectroscopy [18] and radiolytic studies [20, 21].

Here, it should be emphasised that in order to become a part of the cellular DNA, after translocation from the extracellular environment to the cytosol, a nucleoside has to be phosphorylated to its triphosphate. The first stage of phosphorylation, the conversion of the nucleoside to the nucleoside monophosphate, is usually the rate-limiting step [22]. Hence, the favourable DEA profile [18, 19] is necessary, but not a sufficient requirement for an efficient radiosensitising nucleoside. Taking into account that chemical synthesis is frequently not a trivial task and usually is a time-consuming and costly process, it is difficult to overestimate access to a computational model that could tell, yet before the actual synthesis, if the proposed derivative is a good substrate for nucleoside kinase. Basically, one can employ a variant of the quantum mechanics/molecular mechanics (QM/MM) approach to model the first, most demanding phosphorylation step of a studied nucleoside by hTK1 kinase. However, this is definitely not a simple task. Indeed, the experimental structures of the hTK1 kinase are not suitable for reaction mechanism studies since hTK1 changes its quaternary structure upon binding substrates [23]. On the other hand, the active site amino acids are flexible [24], which makes accurate diffraction measurement for the enzymatic pocket impossible. Out of three available crystal structures [24,25,26], only one monomer in one of those structures includes full amino acid sequence of the active site. Thus, in order to obtain the active hTK1 geometry, one has to complete missing amino acid sequence, dock proper substrates, and carry out a long molecular dynamics (MD) simulation that should lead to a ligand-hTK1 complex suitable for the QM/MM calculations. Only a representative structure from the MD simulation, with the proper choice of high-layer atoms and additional constraints, if necessary, would enable the QM/MM calculations to be completed. Moreover, a few enzymatic mechanisms for phosphate transfer have been proposed so far [27, 28]: dissociative, associative or concerted and they all should be modelled to decide which of the pathways is most probable. Although the computational procedure described above would bring us a direct insight into the enzymatic process under consideration as well as allow to select nucleosides suitable for phosphorylation, it is computationally very demanding and needs up to dozen or so months to be completed. Taking into account the fact that usually a significant number of possible derivatives have to be scrutinised, a less time-consuming approach is desirable. Therefore, quantitative structure-activity relationship (QSAR) seems to be a method of choice for selecting the most promising nucleoside derivatives since to build a statistical model and choose appropriate compounds, one needs only a fraction of time compared to the described above QM/MM methodology. It is interesting to mention, that recently, the “intelligent” consensus modelling approach was proposed [29]. This approach integrates models that are developed by means of different combinations of descriptors and/or different modelling methods; in consequence, prediction is based on multiple individual models (each based on individual set of descriptors) instead of single regression equation. It offers improvements of the predictability of the model [29]. On the other hand, due to the fact that consensus model is based on large number of descriptors, its mechanistic interpretation is frequently more difficult (in comparison to simple QSAR model) [30]. In terms of the Organisation for Economic Cooperation and Development principles related to QSAR model development and validation [31], the benefits related to improvement of the model’s statistics are as valuable as the complexity and interpretability power of the model.

In the following, we will demonstrate the development of a simple correlation model that predicts with an acceptable accuracy the ease of various uridine derivatives to be phosphorylated by human thymidine kinase 1 (hTK1). The derived correlation equation depends on three descriptors only, which enables the observed activities to be rationalised in terms of the molecular features of the studied nucleosides.

Methodology

Data collection

The experimental data related to the hTK1 phosphorylation activity was collected from the available literature [32, 33] and was logarithmically transformed to decrease the range of data variation. The experimental data were available for 26 nucleoside analogues. Details related to the structures of analysed nucleosides one can find in Table S1 (Supplementary materials).

Molecular descriptors calculation

Geometries of nucleoside analogues in the trans-conformation were optimised with the Gaussian09 software [34], at the B3LYP/6-31++G(d,p) density functional theory level with the polarisable continuum model [35] to account for water environment. After the optimisation, the molecular descriptors were calculated with the use of the DRAGON software [36]. We have calculated 957 descriptors for each nucleoside.

QSAR model development and validation

In the first step, the set of 26 nucleosides was sorted according to the increasing values of log A. Then, data was divided into training (T—1) and validation (V—2) sets with the usage of the “3:1 algorithm”, in which every the third compound is assigned to the validation set, whereas the remaining ones form the training set. The second and first before the last compounds were independently assigned to validation set.

QSAR model was developed with the application of the multiple linear regression (MLR) approach [37, 38]. We assumed that the modelled activity (log A) could be expressed as a function of molecular descriptors (x1, x2, x3,…):

$$ \log\ \mathrm{A}={a}_1\ {\boldsymbol{x}}_{\boldsymbol{1}}+{a}_2\ {\boldsymbol{x}}_{\boldsymbol{2}}+{a}_3\ {\boldsymbol{x}}_{\boldsymbol{3}}+\dots +{a}_n\ {\boldsymbol{x}}_{\boldsymbol{n}}+b $$
(1)

where a1, a2, a3, …, an are the regression coefficients and b is the intercept.

The optimal descriptors were selected in two steps: first, to reduce the descriptor-compound ratio, we selected descriptors significantly correlated with log A (r > 0.60), and then we applied genetic algorithm (GA) [39] implemented in the QSARINS software [40]. The purpose of the genetic algorithm application was to find the descriptors that allow to obtain model with the highest validation and cross-validation parameters. The setup of GA was as follows: generation per size = 500 and mutation rate = 45%.

The model fitting, robustness and predictive abilities were evaluated based on the parameters summarised in Table 1 [41,42,43,44,45,46]. Model’s robustness was also verified by the application of the Y-scrambling procedure. The Williams plot technique was employed to determine the QSAR model applicability domain [47,48,49,50]. Additionally, in order to select an optimal model (based on optimal set of descriptors), the double cross-validation procedure was performed with the double cross-validation tool (version 2.0) developed by Roy and Ambure [51].

Table 1 Quality measures for QSAR models [41,42,43,44,45,46]

Results and discussion

The multiple linear regression technique was employed in order to develop the quantitative structure-activity relationship model (QSAR) [37, 38]. This model allows to identify the physicochemical features of nucleoside derivatives governing the hTK1 kinase activity, which should enable new hTK1 substrates to be designed rationally and tested before synthesis.

As it was mentioned in the “Methodology” section, the experimental data related to the hTK1 phosphorylation activity were collected from the available literature [32, 33] and were logarithmically transformed to decrease the range of data variation (Table 2).

Table 2 Experimental and predicted values of log A with data split into a training set and a validation set as well as other QSAR model details [32, 33]

The QSAR model is described by Eq. (2):

$$ {\displaystyle \begin{array}{l}\log \mathrm{A}=0.073-0.011\ \boldsymbol{P}\_\boldsymbol{VSA}\_\boldsymbol{LogP}\_6+2.92\ \boldsymbol{HATS}4\boldsymbol{v}+3.52\ \boldsymbol{E}3\boldsymbol{u}\\ {}n=20,{n}_{\mathrm{val}}=6,\kern1.25em F=16.53,\kern1.25em p<{10}^{-4},\kern0.75em {R}^2=0.75\kern0.5em {Q}_{CV}^2=0.65\ {Q}_{\mathrm{Ext}}^2=0.71,\\ {}{\mathrm{RMSE}}_C=0.17,{\mathrm{RMSE}}_{CV}=0.19,{\mathrm{RMSE}}_P=0.17,{r^2}_m=0.55,\mathrm{CCC}=0.86\end{array}} $$
(2)

where n and nval stand for the number of compounds in the training and validation set, respectively, while the remaining parameters are defined in Table 1.

The visual correlation between the experimentally measured and predicted values of log A for the training (T) and validation (V) sets, presented in Fig. 1, confirms the high quality of the model. All parameters listed below Eq. (2) and employed to verified model’s fitting, robustness and predictive abilities (calculated by means of quality measurement summarised in Table 1, [41,42,43,44,45,46]) meet the required criteria and, therefore, also proves the model’s quality. The Y-scrambling procedure (Fig. 2) additionally confirms that obtained model was not a result of by chance correlation.

Fig. 1
figure 1

Calculated vs. observed values of log A

Fig. 2
figure 2

Y-scrambling results: average values of the square errors of calibration and cross-validation of the real QSAR model and 400 random models

The applicability domain of the QSAR model was verified based on the plot of the standardised residuals (differences between the predicted and observed values of log A) versus the leverages, the so-called Williams plot (Fig. 3) [47,48,49,50]. The leverage value, in this case, expresses similarity of a given compound, for which the prediction is made to the set of training compounds. Thus, the Williams plot helps to assess the influence of structural similarity between the compounds on the predicting error. All deoxythymidine derivatives used in the training and validation sets were situated in the range of residuals differing by ± 3 standard deviations from the mean value and do not exceed calculated leverage threshold for this model. Thus, there were no outlying predictions observed.

Fig. 3
figure 3

Williams plot: standardised residuals versus leverages. Solid lines indicate ± 3 standard deviation units. Dash lines indicates the threshold value (h* = 0.60)

The QSAR model depicted by Eq. (2) is a linear combination of three descriptors (descriptors are not correlated to each other, Fig. S1 in the Supplementary materials), namely: P__VSA_logP_6, HATS4v and E3u. The selection of the optimal model (based on the optimal set of descriptors) was confirmed by the double cross-validation procedure [51] (Table S2 in the Supplementary materials). The conformation changes of a molecule as well as the total number and bond lengths influence the value of the HATS4v and E3u descriptors. Thus, both indirectly provide information related to the molecular shape. The P_VSA descriptor, on the other hand, expresses the van der Waals surface area (VSA) that is occupied by molecule atoms having given property in a certain range [52, 53]. Several properties could be taken into account in P_VSA descriptor calculations, such as atomic weight and ionisation potential. In the case of P_VSA_logP_6, lipophilicity is considered. Thus, this descriptor represents the size of VSA occupied by the atoms with lipophilicity within the specified range and thus encodes the availability of molecular fragments for intermolecular hydrophobic interactions [53].

The worked out QSAR model (Eq. (2)) indicates that both impact of atoms on the shape of the nucleoside analogues, coded in HATS4v and E3u descriptors, and the size of their VSA capable of intermolecular hydrophobic interactions, described by P_VSA_logP_6, are the key properties that influence the hTK1 phosphorylation activity against deoxythymidine derivatives. The sign of the equation coefficient related to the P_VSA_logP_6 descriptor is negative while those related to HATS4v and E3u are positive. This indicates that increasing the values of P_VSA_logP_6 results in decreasing the values of the hTK1 activity while increasing the values of HATS4v and E3u causes the increase in its activity (compounds with higher values of P_VSA_logP_6 and lower values of HATS4v and E3u are worse substrates for hTK1).

Among the nucleoside analogues employed for the development of the QSAR model, there are derivatives substituted at deoxyribose: 3′-OH (e.g. AZMT) and 2′ analogues (e.g. FMAU) as well as molecules with the N3 (e.g. IsoT) and 5C modification of pyrimidine (e.g. 5-CldU; Table 2). The highest activities are observed for the compounds substituted with halogens at the 5C position. These derivatives exhibit the highest values of both the HATS4v and E3u descriptors (see Table 2). Modification of the N3 position with an alkyl residue results in the decreased values of both descriptors and in consequence decreases the hTk1 activity. It indicates that substitution at the N3 position with the alkyl group changes the shape of a molecule. It was proven that substitution in this position hinders the hydrogen bonding between the N3 nitrogen of pyrimidine base and the main chain carbonyl of residue 178 in the enzyme that is required to tight spacing of the lasso-like loop [32, 33, 54]. Substitution at deoxyribose is also not favourable. 3′-OH analogues as well as 2′ derivatives exhibit lower activity than dT. The only exception is FIAU. However, this nucleoside is substituted not only at the 2′ position but also at the 5 position of pyrimidine. The lower activity (higher values of P_VSA_logP_6 and lower values of HATS4v and E3u) of 3′-OH derivatives is probably related to the lack of hydrogen bonds that cannot be formed in the ligand-enzyme complex between the substituted 3′-OH group and the amino group of Gly182 [32, 33, 54,55,56]. 2′-Fluoro derivative of deoxythymidine differs from dT mostly in the value of P_VSA_logP_6 while the HATS4v and E3u descriptors remain almost the same. This indicates that the shape of the FMAU is, in this case, not changed and that the addition of fluorine in 2′ position affects intermolecular interactions. This agrees with the finding that interactions between the 2′-OH group and Tyr187 are hindered when the nucleoside is substituted at the 2′ position [32]. Indeed, the Tyr187 nucleoside 2′-OH interactions are necessary to keep the lasso in place [32].

Considering that the most promising group of analogues constitutes 5C derivatives, we will finally focus on this class of nucleosides. It can be noticed that the changes in the activity of halogen derivatives follow the following pattern: 5-CldU > 5-IdU > 5-BrdU > 5-FdU and the worst substrate is the unsubstituted nucleoside, i.e. dT. When one compares the values of descriptors calculated for all these molecules, one finds out that the HATS4v and E3u descriptors which code the size and shape of the studied analogues have the smallest values, equal to 0.160 and 0.437, respectively, for dT (see Table 2). P_VSA_logP_6 for 5-halogenated derivatives are similar (except that for 5-FU), which means that they are similar in terms of intermolecular interactions. The impact of size of substituent at position 5 on the phosphorylation reaction was previously proven. The proper size of a substituent at the 5 position is required to fit the binding pocket in the hTK1 enzyme [32]. However, due to the fact that activities of the iodo and methyl analogues are different beside similar size of 5-substitients, the steric hindrance is not sufficient to explain the observed differences in activity. The other important issue is their ability to intermolecular interactions. The polarisability of iodine is much larger than that of the other analysed 5-substituents [57], which might be responsible for the stronger interaction between the nucleoside and enzyme, and in consequence makes 5-IdU the best (among the analysed compound) substrate for hTK1.

Conclusion

Due to the limited availability of structural data concerning hTK1, the significant flexibility of its enzymatic pocket, complex phosphorylation mechanism and high computational cost of QM/MM simulations, showing that a given nucleoside is a suitable substrate for hTK1 with calculations at the atomic level, seem to be more than difficult. Therefore, we have developed a QSAR model that allows to identify the molecular features of thymidine analogues governing their activity against hTK1. Two properties turned out to be the key features responsible for the phosphorylation process: the shape of the nucleoside determined by atom substitution coded in the HATS4v and E3u descriptors and the ability of molecules to hydrophobic interactions coded by the P_VSA_logP_6 descriptor. The model meets all requirements related to QSAR model’s development and therefore can be applied to predict the activity of new nucleoside before its synthesis. The obtained results also indicate that the most promising analogues are those substituted at the 5C position. Designing a new substrate for hTK1 one should focus on the modification of uridine at this position. A valuable substituent needs to meet two requirements: be at least as large as methyl group and be able to interact strongly with the enzyme-binding pocket.