Introduction

Acquired immune deficiency syndrome or acquired immunodeficiency syndrome (AIDS) is a disease of the human immune system caused by the human immunodeficiency virus (HIV). This condition would progressively reduce the effectiveness of the immune system and leaves individuals susceptible to opportunistic infections and tumors (Jabs, 2011; Chitra et al., 2011; Ganguli et al., 2012; Holland et al., 2010; Wachira and Ruger, 2011).

Acquired immunodeficiency syndrome is now a pandemic, and it has been the sixth leading cause of death among people aged 25–44 in the United States since 1995. The World Health Organization estimated that more than 25 million people worldwide have died from this infection since the start of the epidemic (Kallings, 2008). In 2009, AVERT reported that there were 33.3 million people worldwide living with HIV/AIDS, with 2.6 million new HIV infections per year and 1.8 million annual deaths due to AIDS. In 2007, UNAIDS estimated that 33.2 million people worldwide had AIDS that year, AIDS killed 2.1 million people in the course of that year, including 330,000 children, and moreover 76 % of those deaths occurred in sub-Saharan Africa. According to UNAIDS 2009 report, we have had 60 million infected people, 25 million deaths, and 14 million orphaned children in southern Africa since the epidemic began (Nagata et al., 2011; Furin et al., 2012).

Human immunodeficiency virus (HIV) causes AIDS. The virus attacks the immune system and leaves the body vulnerable to a variety of life-threatening infections and cancers. Common bacteria, yeast, parasites, and viruses which do not ordinarily cause serious diseases in people with healthy immune systems can cause fatal illnesses in people with AIDS.

HIV has been found in saliva, tears, nervous system tissue and spinal fluid, blood, semen (including pre-seminal fluid, which is the liquid that comes out before ejaculation), vaginal fluid, and breast milk. However, only blood, semen, vaginal secretions, and breast milk generally transmits infection to others (Schmidt, 2011). The virus can be spread (transmitted) by sexual contact (including oral, vaginal, and anal sex), blood [via blood transfusions (now extremely rare in the U.S.) or needle sharing], exchange between mother and baby during pregnancy, childbirth, breastfeeding, or other exposures to one of the above bodily fluids; other methods of spreading the virus are rare and include accidental needle injury, artificial insemination with infected donated semen, and organ transplantation with infected organs. AIDS is not transmitted to a person who donates blood or organs. However, HIV can be transmitted to a person receiving blood or organs from an infected donor. To reduce this risk, blood banks and organ donor programs screen donors, blood, and tissues thoroughly (Johnston et al., 2010; Firląg-Burkacka et al., 2009).

Although treatments for AIDS and HIV can slow the course of the disease, there is no known cure or vaccine. Antiretroviral treatment reduces both the mortality and the morbidity of HIV infection, but these drugs are expensive, and routine access to antiretroviral medication is not available in all countries (Guo and Li, 2011; Fomsgaard et al., 2011). Due to the difficulty in treating HIV infection, preventing infection is a key aim in controlling the AIDS pandemic, with health organizations promoting safe sex and needle-exchange programs in attempts to slow the spread of the virus. HIV is transmitted through direct contact of a mucous membrane or the bloodstream with a bodily fluid containing HIV, such as blood, semen, vaginal fluid, preseminal fluid, and breast milk (Self, 2010).

Acquired immunodeficiency syndrome begins with HIV infection. People infected with HIV may have no symptoms for 10 years or longer, but they can still transmit the infection to others during this symptom-free period. If the infection is not detected and treated, the immune system gradually weakens and AIDS develops. People with AIDS also have an increased risk of developing various cancers such as Kaposi’s sarcoma, cervical cancer, and cancers of the immune system known as lymphomas. In addition, people with AIDS often have systemic symptoms of infection like fevers, sweats (particularly at night), swollen glands, chills, weakness, and weight loss (Holmes et al., 2003). The specific opportunistic infections that AIDS patients develop depend, in part, on the prevalence of these infections in the geographic area in which the patient lives. The initial infection with HIV may produce no symptoms: some people, however, do experience flu-like symptoms with fever, rash, sore throat, and swollen lymph nodes, usually 2–4 weeks after contracting the virus. Some people with HIV infection stay symptom-free for years between the time they are exposed to the virus and when they develop AIDS (Lyons et al., 2011).

An anti-HIV agent can exert its biological activity in different stages of the viral life cycle inhibiting them. Studies were limited to those stages and phenomenon that appear during viral replication: viral binding to the target cell, viral fusion with the host cell by viral penetration into the host cell’s membrane, viral uncovering in the host cell, reverse genomic RNA transcription, integration of the new viral DNA into the host cell’s chromosomes, provirus activation producing mRNA, viral detachment from the host cell, and viral maturation.

Reverse transcription of viral genomic RNA into double strained DNA by the RT enzyme is essential for HIV replication. Thus, the inhibition of this essential phase of HIV life cycle provides the most attractive target in order to develop a compound with biological anti-HIV potential. For example, most drugs approved by the FDA for HIV infection treatment are RT inhibitors. High resolution electronic microscopy shows that HIV-1 is a 100 nm virus with a capsule. The external layer is a double lipidic layer derived from the host cell during maturation and contains two major viral glycoproteins (gp): the transmembranar gp41 and outside gp120. There is a protein associated to the membrane (p 18) which provides the matrix for the viral structure and is essential for the integrity of the virus. The matrix surrounds a dense cylindrical characteristic nucleoid which contains the p24 protein from the capside. Inside the nucleoid, there are two identical RNA strains; the viral RNA dependent DNA-polymerase (p66/p55) called reverse-transcriptase (RT) is related to p9 nucleoprotein, to p12 integrase protein, and to components of p15 protease, see Fig. 1 (Ganguli et al., 2012; Wachira and Ruger, 2011; Holmes et al., 2003; Lyon et al., 2011).

Fig. 1
figure 1

a The human immunodeficiency virus (HIV) Anatomy b Life cycle of HIV

By these means, HEPT (1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine) derivatives can be regarded as non-nucleosidic reverse transcriptase inhibitors (NNRTI), see Figs. 2 and 3, and are analogs of the natural substrate. HEPT derivatives don’t interact with the binding site of the DNA or RNA-dependent DNA polymerase. Because of this it is expected that these ligands would not determine side effects. HEPT ligands interact uncompetitively with an allosteric site of the enzyme and don’t affect the substrate binding in a direct way. Actually, NNRTI have a higher binding affinity to the ligand–enzyme complex than to the free enzyme. The HEPT ligand–enzyme interaction leads to enzymatic conformational variations; in other words, the enzyme’s active site has a decreased affinity to the natural substrate. This property is valid only regarding the HIV-1 RT; HEPT ligands are inactive against HIV-2 or other retroviruses. The NNRTI exclusive specificity for the HIV-1 RT is attributed to the presence—at the level of this enzyme (and not in the case of other RT or DNA polymerases)—of a flexible extreme hydrophobic pocket in which HEPT derivatives (different from natural substrate analogs) fit and can be bound (Ji et al., 2007; Wang et al., 2009; Bajaj et al., 2005).

Fig. 2
figure 2

The reference structure of HEPT derivatives

Fig. 3
figure 3

Typical examples of HEPT (1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine) derivatives

The term ‘half maximal effective concentration’ (EC50) refers to the concentration of a drug, antibody, or toxicant, which induces a response between the baseline and maximum after some specified exposure time. It is commonly used as a measure of a drug’s potency. The EC50 of a graded dose–response curve represents the concentration of a compound where 50 % of its maximal effect is observed. The EC50 of a quantal dose–response curve represents the concentration of a compound where 50 % of the population exhibits a response, after specified exposure duration (Luis et al., 2010).

Various partial drugs which have been created would treat the HIV infection at various stages but no drug has been found yet to cure. Because of this, we need to comprehend the chemicals and mathematical models that could be applied as an extrapolation model to study the desired features of an anti-HIV drug. The best mathematical model that can quantitatively relate the anti-HIV activity with the structural descriptors is the QSAR model (Quantitative Structure Activity Relationship). The QSAR analysis has been done for various groups of compounds and also for diverse sets of anti-HIV compounds (Goodarzi and Freitas, 2010; Bharate and Singh, 2011; Goodarzi et al., 2009; Si et al., 2008).

There is a trend to develop QSAR from a variety of methods. In particular, genetic algorithm (GA) is frequently used as search algorithm for variable selections in chemometrics and QSAR (Yanmaz et al., 2011). Moreover, nonlinear statistical treatment of QSAR data is expected to provide models with better predictive quality as compared with linear models. In this perspective, artificial neural network (ANN) modeling has become quite common in the QSAR field (Afantitis et al., 2011; Zuperl et al., 2011). Extensive use of ANN, which does not require the “a priori” knowledge of the mathematical form of the relationship between the variables, largely rests on its flexibility (functions of any complexity can be approximated). In recent years, nonlinear kernel-based algorithm as kernel partial least squares (KPLS) has been proposed (Postma et al., 2011). KPLS can efficiently compute latent variables in the feature space by means of nonlinear kernel functions. Compared to other nonlinear PLS methods, the main advantage of the kernel-based algorithm is that it does not involve nonlinear optimization; thus it essentially requires only linear algebra which makes it as simple as the conventional linear PLS. In addition, because of its ability to use different kernel functions, KPLS can handle a wide range of nonlinearities (Cao et al., 2011). In the present study, GA-KPLS and L–M ANN were employed to generate QSAR models that correlate the structure of HEPT ligands and inhibitors of HIV reverse-transcriptase (RT), with anti-HIV biological activity log (1/EC50).

Computational

Data set

The anti-HIV biological activity log (1/EC50) of 79 HEPT derivatives which were taken from the literature (Duda-Seiman et al., 2004) has been presented in Table 1. In this table are given the group of substituents considered on the general structure of Figs. 2 and 3. Biological activities are given as log (1/EC50) where EC50 represents the concentration and also produces a 50 % protection of MT-4 cells against the direct toxic HIV-1 effect.

Table 1 The data set, structure, and the corresponding observed log (1/EC50) values

Computer hardware and software

All calculations were run on a HP laptop computer with an AMD Turion64X2 processor and a Windows XP operating system. The optimizations of molecular structures were done by HyperChem 7.0 and descriptors were calculated by Dragon Version 3.0 software. Cross validation, GA-KPLS, L–M ANN and other calculations were performed in the MATLAB (Version 7, Mathworks, Inc.) environment.

Molecular modeling and theoretical molecular descriptors

The derivation of theoretical molecular descriptors proceeds from the chemical structure of the compounds. In order to calculate the theoretical descriptors, molecular structures were constructed with the aid of HyperChem version 7.0. The final geometries were obtained with the semi-empirical AM1 method in HyperChem program. The molecular structures were optimized using Fletcher–Reeves algorithm until the root mean square gradient was 0.01 kcal mol−1. The resulting geometry was transferred into Dragon program in order to calculate 1,497 descriptors, which was developed by Todeschini et al., (2003).

Genetic algorithm for descriptor selection

To select the most relevant descriptors with GA, the evolution of the population was simulated (Noorizadeh and Noorizadeh, 2012; Van Dijck and Van Hulle, 2011; Cséfalvayová et al., 2010). Each individual of the population, defined by a chromosome of binary values, represented a subset of descriptors. The number of the genes at each chromosome was equal to the number of the descriptors. The population of the first generation was selected randomly. A gene was given the value of one if its corresponding descriptor was included in the subset; otherwise, it was given the value of zero. The number of the genes with the value of one was kept relatively low to have a small subset of descriptors (Hao et al., 2011); in other words, the probability of generating zero for a gene was set greater. The operators used here were crossover and mutation. The application probability of these operators was varied linearly with a generation renewal. For a typical run, the evolution of the generation was stopped, when 90 % of the generations had taken the same fitness. In this paper, size of the population is 30 chromosomes, the probability of initial variable selection is 5:V (V is the number of independent variables), crossover is multi Point, the probability of crossover is 0.5, mutation is multi Point, the probability of mutation is 0.01, and the number of evolution generations is 1,000. For each set of data, 3,000 runs were performed.

Nonlinear model

Artificial neural network

An artificial neural network (ANN) with a layered structure is a mathematical system that stimulates biological neural network consisting of computing units named neurons and connections between neurons named synapses (Noorizadeh and Farmany, 2012; Garkani-Nejad and Ahmadi-Roudi, 2010; Singh et al., 2010). All feed-forward ANN used in this paper are three-layer networks. Each neuron in any layer is fully connected with the neurons of a succeeding layer. Figure 4 shows an example of the architecture of such ANN. The Levenberg–Marquardt back propagation algorithm was used for ANN training and the linear functions were used as the transformation functions in hidden and output layers.

Fig. 4
figure 4

Used three layer ANN

Results and discussion

Nonlinear models

Results of the GA-KPLS model

The leave-group-out cross validation (LGO-CV) has been performed. In this research, a radial basis kernel function, \( k(x,y) = \exp \left( {{{\left| {\left| {x - y} \right|} \right|^{2} } \mathord{\left/ {\vphantom {{\left| {\left| {x - y} \right|} \right|^{2} } c}} \right. \kern-0pt} c}} \right) \), was selected as the kernel function with \( c = rm\sigma^{2} \) where r is constant that can be determined by considering the process to be predicted (here r set to be 1), m is the dimension of the input space, and \( \sigma^{2} \) is the variance of the data (Kim et al., 2005). It means that the value of c depends on the system under the study. The 14 descriptors in five latent variables—space chosen by GA-KPLS feature selection methods—were contained constitutional descriptors (number of Oxygen atoms (nO) and number of non-H bonds (nBO)), topological descriptors (centralization (CENT)), geometric descriptors (gravitational index G2 (bond-restricted) (G2), 3D Petitjean shape index (PJI3), and Qxx COMMA2 value/weighted by atomic van der Waals volumes (QXXv)), 3D-MoRSE descriptors (3D-MoRSE—signal 09/weighted by atomic masses (Mor09m)), WHIM descriptors (first component accessibility directional WHIM index/weighted by atomic polarizabilities (E1p) and A total size index/weighted by atomic electrotopological states (As)), atom-centered fragments (number of terminal primary C(sp3) (nCp), CH3R/CH4 (C-001) and phenol/enol/carboxyl OH (O-057)) and charge descriptors (relative positive charge (RPCG) and submolecular polarity parameter (SPP)). The R 2 and RE for training and test sets were (0.861, 0.748) and (14.37, 23.09), respectively. For the constructed model, two general statistical parameters were selected to evaluate the prediction ability of the model for the log (1/EC50). The predicted values of log (1/EC50) are plotted against the experimental values for training and test sets in Fig. 5. Consequently, as a result, the number of components (latent variables) is less than the number of independent variables in KPLS analysis. The statistical parameters highest square correlation coefficient leave-group-out cross validation (R 2) and relative error (RE) were obtained for proposed models. Each of the statistical parameters mentioned above was used for assessing the statistical significance of the QSAR model. This GA-KPLS approach currently constitutes the most accurate method for predicting the anti-HIV biological activity of the drug compounds. The KPLS model uses higher number of descriptors that allows the model to extract better structural information from descriptors to result in a lower prediction error. This suggests that GA-KPLS holds promise for applications in choosing variables for L–M ANN systems. This result indicates that the log (1/EC50) of these drugs possesses some nonlinear characteristics.

Fig. 5
figure 5

Plots of predicted log (1/EC50) against the experimental values by GA-KPLS model

Results of the L–M ANN model

With the aim of improving the predictive performance of nonlinear QSAR model, L–M ANN modeling was performed. The networks were generated using the 14 descriptors appearing in the GA-KPLS models as their inputs and log (1/EC50) as their output. For ANN generation, data set was separated into three groups: calibration, prediction, and test sets. A three-layer network with a sigmoid transfer function was designed for each ANN. Before training the networks, the input and output values were normalized between −1 and 1. Then, the network was trained using the training set and the back propagation strategy for optimizing the weights and bias values. The proper number of nodes in the hidden layer was determined by training the network with different number of nodes in the hidden layer. The root-mean-square error (RMSE) value measures how good the outputs are in comparison with the target values. It should be noted that for evaluating the over fitting, the training of the network for the prediction of log (1/EC50) must stop when the RMSE of the prediction set begins to increase while RMSE of calibration set continues to decrease. Therefore, training the network was stopped when overtraining began. All of the above mentioned steps were carried out using basic back propagation, conjugate gradient, and Levenberge–Marquardt weight update functions. Accordingly, one can realize that the RMSE for the training and test sets are minimum when five neurons were selected in the hidden layer. Finally, the number of iterations was optimized with the optimum values for the variables. The R2 and RE for calibration, prediction, and test sets were (0.916, 0.894, 0.868) and (9.98, 11.34, 15.29), respectively. The experimental, calculated, relative error and RMSE values log (1/EC50) by L–M ANN are shown in Table 2. Inspection of the results reveals a higher R 2 and lowers other values parameter for the training, test, and prediction sets compared with their counterparts for GA-KPLS. Plots of predicted log (1/EC50) versus experimental log (1/EC50) values by L–M ANN for calibration, prediction, and test sets are shown in Fig. 6a, b. Obviously, there is a close agreement between the experimental and predicted log (1/EC50), and the data represent a very low scattering around a straight line with respective slope and intercept close to one and zero. This clearly shows the strength of L–M ANN as a nonlinear feature selection method. The key strength of L–M ANN is their ability to allow for flexible mapping of the selected features by manipulating their functional dependence implicitly. The residuals (predicted log (1/EC50) − experimental log (1/EC50)) obtained by the L–M ANN modeling versus the experimental log (1/EC50) values are shown in Fig. 7a, b. As the calculated residuals are distributed on both sides of the zero line, one may conclude that there is no systematic error in the development of the neural network. The whole of these data clearly displays a significant improvement of the QSAR model consequent to nonlinear statistical treatment.

Table 2 Experimental, calculated, relative error, and RMSE values log (1/EC50) by L–M ANN model
Fig. 6
figure 6

Plot of predicted log (1/EC50) obtained by L–M ANN against the experimental values a calibration and prediction set of molecules and b for test set

Fig. 7
figure 7

Plot of residuals obtained by L–M ANN against the experimental log (1/EC50) values a training set of molecules and b for test set

Model validation and statistical parameters

The applied internal (leave-group-out cross validation (LGO-CV)) and external (test set) validation methods were used for the predictive power of models. In the leave-group-out procedure, one compound was removed from the data set, the model was trained with the remaining compounds and used to predict the discarded compound. The process was repeated for each compound in the data set. The predictive power of the models developed on the selected training set is estimated on the predicted values of test set chemicals. The data set should be divided into three new sub-data sets, one for calibration and prediction (training), and the other one for testing. The calibration set was used for model generation. The prediction set was applied to deal with overfitting of the network, whereas test set, the molecules of which have no role in model building was used for the evaluation of the predictive ability of the models for external set.

On the other hand by means of training set, the best model is found and then, its prediction power is checked by test set, as an external data set. In this study, from all 79 components, 48 components are in calibration set, 16 components are in prediction set, and 15 components are in test set).

The result clearly displays a significant improvement of the QSAR model consequent to nonlinear statistical treatment and a substantial independence of model prediction from the structure of the test molecule. In the above analysis, the descriptive power of a given model has been measured by its ability to predict partition of unknown drugs.

For the constructed models, some general statistical parameters were selected to evaluate the predictive ability of the models for log (1/EC50) values. In this case, the predicted log (1/EC50) of each sample in prediction step was compared with the experimental acidity constant. The first statistical parameter was relative error (RE) that shows the predictive ability of each component, and is calculated as:

$$ {\text{RE}}\;(\% ) = 100 \times \left[ {\frac{1}{n}\sum\limits_{i = 1}^{n} {\frac{{(y_{i}^{ \wedge } - y_{i} )}}{{y_{i} }}} } \right] $$
(1)

The predictive ability was evaluated by the square of the correlation coefficient (R 2) which is based on the prediction error sum of squares and was calculated by the following equation:

$$ R^{2} = \frac{{\sum\limits_{i = 1}^{n} {(y_{i}^{ \wedge } - \bar{y})} }}{{\sum\limits_{i = 1}^{n} {(y_{i} - \bar{y})} }} $$
(2)

where y i is the experimental log (1/EC50) in the sample i, \( y_{i}^{ \wedge } \) represented the predicted log (1/EC50) in the sample i, \( \bar{y} \) is the mean of experimental log (1/EC50) in the prediction set and n is the total number of samples used in the test set.

The main aim of the present study was to assess the performances of GA-KPLS and L–M ANN for modeling the anti-HIV biological activity of drugs. The procedures of modeling including descriptor generation, splitting of the data, variable selection, and validation were the same as those performed for modeling of the log (1/EC50) of HEPT ligands and RT drugs.

Conclusion

In the current research, two nonlinear methods (GA-KPLS and L–M ANN) were used to construct a quantitative relation between the anti-HIV biological activity of HEPT ligands and RT drugs and their calculated descriptors. The results obtained by L–M ANN were compared with the results obtained by GA-KPLS model. The results demonstrated that L–M ANN was more powerful in the log (1/EC50) prediction of the drug compounds than GA-KPLS. A suitable model with high statistical quality and low prediction errors was eventually derived. This model could accurately predict the anti-HIV biological activity of these components that did not exist in the modeling procedure. It was easy to notice that there was a good prospect for the L–M ANN application in the QSAR modeling.