Background

Treatment with effective ART decreases morbidity and mortality among HIV positive individuals [1, 2]. Effective antiretroviral therapy (ART) should lead to undetectable viral load within 6 months of initiation of therapy [3]. Achievement of early viral suppression (suppression by 24 weeks) predicts long term treatment success as measured by virological suppression, CD4+ cell count increase and reduction in mortality [4, 5]. However in sub-Saharan Africa, more than 24% of patients receiving first line ART have virological failure within 1 year of initiation of therapy [6, 7]. Furthermore, treatment failure and subsequent switching of therapy from first line to second line ART was reported to occur as early as 6 and 7 months respectively, after ART initiation in resource limited settings [8, 9]. Failure to achieve and maintain viral suppression may lead to development of resistance and increase the risk of viral transmission [6, 10, 11].

Attainment of early virological suppression depends on a number of factors including choice of initial ART regimen especially in ART naïve patients, ART adherence, comorbidities, and inter-individual variability in drug pharmacokinetics, demographic and genetic factors and drug resistance, baseline viral load and CD4 count [12,13,14,15,16,17,18,19]. Leveraging the knowledge of a combination of all or some of these factors through rapid risk calculation to predict early viral outcomes in individual patients before initiation of ART would enhance clinical decision making and prevent adverse outcomes of treatment failure and the costs associated with switching to second line ART [20].

Machine learning models have been developed and used to predict virological response to ART. However, the use of such models to guide therapeutic decision making may be limited by two major reasons. Many of these models heavily rely on viralogical resistance genotype data which may not be available in resource limited settings [21,22,23]. Those that avoide genotype data make use of relatively complex classifiers such as random forests(RF) or artificial neural networks (ANN) as the backbone of on-line prediction tools [20, 24,25,26,27]. Such tools and methods are not easily interpretable by medical providers and are inaccessible in resource limited settings where computing facilities may not be available. Logistic regression is popular among medical practitioners owing to its interpretability and ease of application without need for a computer. Therefore, logistic regression based machine learning may solve the above mentioned limitations of the available virological response prediction tools. The purpose of this study was to assess the performance of 3 logistic regression based machine learning methods at predicting early virological failure in HIV patients initiating ART.

Methods

Patient cohorts

Data from two independent cohorts was used in this analysis. The Infectious Diseases Institute (IDI) cohort data was used for training the prediction model and testing its generalizability while data from the efavirenz (EFV) cohort was used to test the model’s ability to predict outside the studied population (transportability).

This IDI cohort data obtained from the integrated clinic enterprise application (ICEA) database implemented and maintained at IDI [28]. The database is regularly validated for quality, completeness and discrepancies. The data consists of 559 consecutive HIV patients enrolled between April 2004 and April 2005. Upon recruitment, patients were initiated on one of 3 ART regimens namely stavudine/lamivudine/nevirapine (30/300/200 mg) or (40/300/200 mg) and Efavirenz /Zidovudine/Lamivudine (600/150/300 mg). Patients were followed up every 6 months but intermediate visits occurred for some patients. Patient information was collected on all visits and included demographic data, previous and current opportunistic infections, non-HIV related clinical events, WHO stage, vital signs, ART regimen, physical examination results, adherence to ART, ART toxicity, ART substitution reasons, complete blood count, liver and renal function tests, CD4 count, HIV viral load, death and the cause of death. The cohort is still undergoing observation and details about the cohort study procedure have been reported before [29,30,31]. The observational study was approved by the institutional review board and Uganda National Council of Science and Technology (UNCST).

The EFV cohort data consisted of a cohort that was recruited for an Efavirenz dose optimization study [32]. The data consists of 262 ART naïve HIV/AIDS patients treated for HIV with standard dose Efavirenz /Zidovudine/Lamivudine (600/150/300 mg). These patients were recruited from Mulago National referral hospital, kampala (n = 155), Butabika hospital, kampala (n = 60) and Bwera hospital, kasese (n = 47) in the years 2008 and 2009. One hundred and fifty eight of those were TB co-infected at the time of initiation of ART. Only 235 patients in this data had viral load counts collected in the first 6 months of ART. Baseline demographic characteristics (age, weight, sex, TB disease status) as well as CD4 count and viral loads were collected in these patients Follow-up visits occurred on days 3, 56, 84, 112, 140, 148 and 168. Each participant provided at least 2 viral load count measures. The study was approved by the institutional review boards and UNCST. Details about the data have been published before [32].

The machine learning algorithms

Three logistic regression based modelling approaches were used to model the longitudinal data. These included; Simple logistic regression (SLR), multitask temporal logistic regression (MTLR) and patient specific survival prediction modelling (PSSP).

Simple logistic regression

In this approach, all data was aggregated together as if the outcome occurred at the same time point (6 months). The outcome variable was set to 1 if viral suppression was achieved or 0 if not achieve at 6 months of ART. Baseline predictors were used to predict the outcome using logistic regression.Ifyi and xi are the observation and its corresponding vector of predictors respectively, such that y∈[0, 1] and θ is a vector of coefficients, the probability of virological suppression is given by

$$ {h}_i=P\left({y}_i=1|{x}_i,\theta \right)=\frac{1}{1+{e}^{-{\theta}^T{x}_i}} $$
(1)

L2-regularization was applied to the model to reduce overfitting. This was accomplished by optimizing the following cost function.

$$ \sum \limits_{i-1}^N\left({y}_i\log \left({h}_i\right)+\left(1-{y}_i\right)\log \left(1-{h}_{i,}\right)\right)+{\lambda}_1{\left\Vert \theta \right\Vert}_2^2 $$
(2)

The hyperparameters λ1 controls overfitting and N is the total number of individuals in the training dataset.

Multitask temporal logistic regression (MTLR)

Each clinic visitation day was assumed to be a unique learning task for which a logistic regression classification model was trained (fitted) and the task specific parameter (coefficients) and probability of virological suppression learned (estimated). Thus for any task t in [1,2...,M], ifyi,t and xi,t are the observation and its corresponding feature vector respectively, such that y∈[0, 1] and θt is a vector of task specific coefficients, the probability of virological suppression is given by

$$ {h}_{i,t}=P\left({y}_{i,t}=1|{x}_{i,t},{\theta}_t\right)=\frac{1}{1+{e}^{-{\theta}_t^T{x}_{i,t}}} $$
(3)

For each task, overfitting was reduced by explicitly controlling the complexity of the model using L2 regularization as described later. Additionally, the similarity between tasks was leveraged without concealing their uniqueness by applying the multitask learning approach. Specifically, all tasks were learned jointly such that the temporal relation between tasks was enforced. This was accomplished by optimizing the following cost function.

$$ \sum \limits_{t=1}^M\left[\sum \limits_{i=1}^{N_t}\left(\ {y}_{i,t}\log \left({h}_{\mathrm{i},\mathrm{t}}\right)+\left(1-{y}_{i,t}\right)\log \left(1-{h}_{i,t}\right)\right)+{\lambda}_1{\left\Vert {\theta}_t\right\Vert}_2^2\right]+{\lambda}_2\sum \limits_{t=1}^{M-1}{\left\Vert {\theta}_{t+1}-{\theta}_t\right\Vert}_2^2 $$
(4)

The first term is likelihood of suppression across all tasks, the second term limits the generalization error via the L2- regularization and the third term enforces the temporal smoothness on weights from adjacent tasks. The hyperparameters λ1and λ2 control overfitting and temporal smoothness, respectively.

Patient specific survival prediction modeling (PSSP)

In this approach, we formulated the problem as a survival one for each patient using the method developed and described by Yu et al. [33]. The aim was to predict whether or not suppression occurs within 168 days and the time at which it occurs for each patient. The dataset was restructured to include only the 4 most commonly shared observation times namely t = {0, 84, 98, 168}, also referred to as tasks, t = {1,..,M}, where M = 4. Patient outcomes, yt∈[0, 1] were recorded for each time point, for each patient, capturing the dependence between observations. Thus, if S is the time point at which undetectable viral load is first recorded for the nth patient, then at all t < S, yi = 0 while at all t ≥ S, yi = 1. The elements of the sequence y = (y1, y2,…,yM) of outcomes over all four time points were encoded as yt,n(s)for the value at time t, where s is the survival time in the sequence. For our 4 time points, there are 5 possible sequences, including a sequence of all 0 s. The logistic regression method was extended to model the probability of observing the survival status sequence for the nth patient as follows:

$$ {h}_i=p\left(Y=\left({y}_1,\dots {y}_T\right)|{x}_n,\Theta \right)=\frac{e^{\left({\sum}_{j=1}^T{y}_j\left({\theta}_j^T{x}_n\right)\right)}}{\sum_{k=0}^M{e}^{f\left({x}_n,k,\Theta \right)}} $$
(5)

Where Θ is the set of all parameter vectors 1,..,θM) and \( f\left(x,k,\Theta \right)=\sum \limits_{i=k+1}^M\left({\theta}_i^Tx\right) \) for 0 < k < M with viral load becoming undetectable (y = 1) in the interval [tk, tk + 1]. In order to predict patient specific survival probabilities and times, we optimize the following cost function:

$$ \sum \limits_{n=1}^N\left[\sum \limits_{t=1}^M{y}_{t,n}^{(s)}\log \left({\theta}_t^T{x}_n\right)-\log \sum \limits_{k=0}^M{e}^{f\left({x}_n,k,\Theta \right)}\right]+{\lambda}_1\sum \limits_{t=1}^M{\left\Vert {\theta}_t\right\Vert}_2^2+{\lambda}_2\sum \limits_{t=1}^{M-1}{\left\Vert {\theta}_{t+1}-{\theta}_t\right\Vert}_2^2 $$
(6)

The first term is the log-likelihood of observing a sequence given parameters θ = [θ1,..,θM] and baseline predictor variables, xfor all N patients. The second term is the L2 regularizer that prevents overfitting and the third term is a regularizer that enforces temporal smoothness on parameters from adjacent observation time points. The hyperparameters λ1 and λ2 control overfitting and temporal smoothness, respectively.

Data preparation and model building

Data preparation

The outcome of interest was viral suppression. This was coded in each row (corresponding to an observation) as 0 or 1 depending on whether the viral load count was above or below 400 copies /ml respectively. The choice of viral load cut-off was based on the lower limit of quantification of the assay (400 copies /ml) at the time of recruitment. The EFV cohort viral load measurements had a lower limit of quantification of 40 copies per ml. However, for this analysis, a cut off of 400 copies/ ml was applied because it encompasses both datasets. The proportion of undetectable viral load observations in the IDI cohort and EFV cohort datasets was 0.47 and 0.69 respectively.

The observation time (clinic visit) was recorded as days after initiation of ART, corresponding to follow-up visits. Patient data up to day 180 (corresponding to 6 calendar months) and day 168 in the IDI and EFV cohorts respectively was used in this analysis. This is because early virological suppression is expected to have occurred by this time if treatment and patient management are effective.

Predictor variables

The predictor variables (features) in the data included sex, baseline age and body weight, TB disease status, ART regimen, baseline CD4 count and viral load (VL) count. Sex coded as 0 or 1 for female and male participants respectively. TB disease status was coded as 0 or 1 depending on whether the participant had been diagnosed at the start of ART with or without TB respectively. ART therapy was also coded with numbers 1 to 3 corresponding to the regimen a patient was initiated on. Age, body weight, CD4, viral load count were left as continuous variables. All these features have been previously reported in literature to have a relationship with virological outcomes [15, 34]. Model training and testing utilized the IDI cohort data.

Data splitting

The IDI cohort dataset was randomly split into training and testing sets in the ratio 2:1 based on individual ID numbers. The training dataset consisted of 322 individuals (765 labelled examples) while the test dataset consisted of 162 individuals (380 labelled examples). The training dataset was used to train the model and learn the feature coefficient (weights). The test dataset was used to assess the performance of the model in predicting outcomes in a previously unseen dataset from a reasonably related population. This is also known as model generalizability testing. Care was taken in the choice of the splitting ration to ensure that the training examples were sufficient and the testing dataset had a minimum of 100 positive and negative outcomes each [35, 36].

Hyperparameters optimization

An exhaustive search for the optimal L2-regularization and temporal smoothing parameters (λ1 and λ2) from a set of 302 pre-specified candidates ranging from 0 to 1000 was done using the grid search method. The combination of λs that maximized the model’s predictive performance on the training dataset was selected as follows. For each of the candidate hyperparameter combination, a 5- fold cross validation was carried out on the training dataset. The training dataset was randomly split into 5 equal parts. Four parts (80% of the data) were used to learn the model coefficients. The fifth part (20%) was used to compute the area under the receiver operator characteristics curve (AUROC). The operation was repeated until each of the five parts had been used for testing. The mean AUROC over the 5 runs was computed. The hyperparameter combination corresponding to the highest mean AUROC was selected and used for model training.

Cost function optimization

The cost functions were optimized using the BFGS (for MTLR and SLR) and Nelder-Mead (for PSSP) algorithms as implemented in the optim library in R software [37,38,39]. At least 10 retries with different sets of starting parameters were used to ensure convergence and stability of the final coefficient estimates. Bootstrap analysis, using 1000 bootstrap replicates was used to obtain the bootstrap mean, median and the 95% confidence intervals for the parameter estimates using the training dataset [40].

Model validation

The goodness of fit (reliability) plot depicting agreement between the observed proportion of viral suppression and predicted probability of virological suppression were generated for each model [41]. In this plot the range of predicted probabilities was discretized into 20 intervals. The mean predicted probability and the associated observed proportion of viral suppression in each interval were calculated and plotted. The points should be near to the diagonal if the model is well calibrated, otherwise the model would be misspecified [42,43,44]. A corresponding sharpness diagram was plotted to show the distribution of the different probability categories used to generate the reliability plot. The root mean squared error (RMSE) with respect to the identity line was also calculated.

The calculated probabilities were used to assess the overall predictive performance of the models by calculating the means squared error (MSE), also known as the brier score [45]. Since the outcome prevalence in the test datasets used for the MTLR and PSSP was 0.47, a brier score of less than 0.245 was considered as satisfactory predictive performance. For the SLR model, the outcome prevalence in the dataset was 0.115 thus a brier score less than 0.102 was considered satisfactory [41].

The model’s discriminative ability was assessed by generating a receiver operator characteristics curve and the corresponding c-statistic (AUROC), and the precision-recall curve and the corresponding area under the precision recall curve (AUPRC) using the non-parametric method [46]. A c-statistic is a measure of the ability of the model to correctly classify those with and without the outcome. C-statistic values of 0.5–0.7,0.7-0.79,0.8–0.89 and > 0.9 were considered, poor, moderate, good and excellent predictions respectively [47]. An AUPRC value above 0.47 was considered satisfactory. The F1 score, which is the harmonic mean of precision and recall, was also calculated. The closer the F1 score to 1 the higher the discriminative ability of the model while values close to 0 meant poor discrimination [48].

The Youden indices (J-statistic) of each model was obtained by searching among plausible values of the predicted probability of outcome for which the sum of sensitivity and specificity was a maximum [49]. For any task, if the patient’s predicted probability was above the obtained J-statistic, viral suppression was predicted to occur therefore the J-statistic was considered the decision boundary (cut-point) between low and high probability patients [50, 51].

Using the cut-point, the performance of the models outside the studied population, setting and period, also known as temporo-spatial transportability was assessed on the EFV cohort dataset, to ensure practical applicability of the model [52]. The shared tasks between the IDI and EFV datasets were day 1, 84, 112, 140 and 168 and model transportability was tested only on these tasks.

The models were used to predict probability of suppression in the EFV dataset. The prediction accuracy, sensitivity, selectivity, positive negative predictive value and positive predictive value were generated for each model.

Results

The distribution of variables between the two cohorts was similar as shown in Table 1 below.

Table 1 distribution of variables in the EFV and IDI cohort datasets

The MTLR model adequately fit the data, implying good model calibration. The PSSP and SLR models showed poor fit to the training set, implying misspecification and poor reliability (see Fig. 1). The RMSEs with respect to the identity line were 0.053, 0.100 and 0.143 for the MTLR, PSSP and SLR models respectively.

Fig. 1
figure 1

Reliability plots showing model calibration. The lower plot is the sharpness diagram showing the distribution of probability categories used to generate the reliability plot

The MTLR and PSSP models showed adequate overall predictive performance with Brier scores less than 0.245 as shown in Table 2. The SLR model did not show adequate predictive performance since it had a brier score higher than the maximum score of 0.102 (outcome prevalence in the dataset was 0.115).

Table 2 The 5 fold cross validated model discriminative characteristics and general predictive performance

The MTLR, PSSP and SLR models showed excellent, moderate and poor discriminative abilities respectively with respect to AUROC as shown in Table 2 and depicted in Fig. 2.

Fig. 2
figure 2

A receiver operator curve (ROC) and precision recall curve (PRC) showing the model discrimination abilities of outcomes in the IDI cohort

Both the MTLR and PSSP models predicted viral suppression in the EFV cohort with adequate accuracy and discrimination as shown in Table 3 below and Fig. 3 below. The SLR model performed worse than random guessing in terms of prediction of discriminative performance on the EFV cohort.

Table 3 Discrimination and prediction accuracy of viral suppression in the EFV cohort by all models
Fig. 3
figure 3

A heatmap showing changes in Feature importance with time after initiation of ART in the MTLR and PSSP models

Figure 3 shows variation of feature importance with time in MTLR model which was the best performing model. The change in appearance of each column compared to the first column indicates a change in the relative importance of the feature over time.

Discussion

In this study, three modelling methods were developed to predict early virological outcomes in patients initiating ART, using their demographic and clinical data and their performance was compared.The machine learning approach to development and validation of these models was chosen to maximize the model prediction accuracy and generalizability since only a limited number of variables were used [53, 54]. Logistic regression based methods were employed because of the popularity of logistic regression among medical practitioners owing to its interpretability of parameters and ease of application. Prediction on external data was improved by penalizing the model coefficients using L2 regularization. L2 regularization was chosen over other regularization methods so as to retain all the selected features in the models but penalize their weights based on their contribution towards the overall predictive performance of the model. Nevertheless, the resultant coefficients cannot be used to infer associations because they are biased to maximize prediction [54].

The multitask learning approach employed in the MLTR and PSSP models captures the relatedness in the outcomes on the different follow-up visits, while retaining the peculiarities of the different outcomes [55]. The PSSP model combines logistic regression models at each task in a temporally dependent manner to form a survival function capable of predicting patient specific survival. On the other hand, the MTLR model does not enforce any dependency between logistic regression models at each task and therefore is a task specific classifier.In both models, temporal smoothness was enforced by regularization which reduced overfitting for the under-sampled tasks, improved prediction accuracy of all tasks and led to better overall generalizability of the model than that of the SLR model. The better predictive performance of the MTLR as compared to the PSSP model could imply absence of dependency between tasks in the data. In otherwords, viralogic status at any time point does not depend on and can not be infered from that at another time point.

The multitask models were able to capture the temporal structures of the outcomes in the data thus enabling the studying the temporal dynamics of the features as depicted in Fig. 3 [56, 57]. The normalized weights of these variables exhibited temporal variation. This implies the relative importance of variables changes over time. This might also explains why the SLR model which allows only a single weight per variable without accounting for temporal variation in feature importance did not fit the data as well as the multitask approaches.

Whereas the MTLR model was the best performing model in all aspects including prediction in the external EFV dataset, the predictive performance of the PSSP model was higher in the external EFV dataset than in the IDI dataset. It was not immediately clear why this was the case.The MTLR model was chosen over the PSSP and SLR models based on goodness of fit plots. Adequate goodness of fit results in model reliability while poor goodness of fit may imply model miss-specification which might affect reproducibility of the model’s predictive performance [42,43,44]. Basing on the goodness of fit plots, the MTLR model is likely to be more reliable than the other two, and was thus chosen as the final model for the subsequent analyses.

The final model was used to develop a risk score that stratifies patients into low and high risk of early virological failure using the Youden index as the cut-off point. The score had good prediction accuracy (92.9%) and satisfactory discriminatory performance (87.8%) in an external dataset from another cohort that is different in geographical location and year of recruitment, from the one used to train and validate the model. This implies that the model is applicable across geographical boundaries and is temporally consistent. The cut-off point was varied to maximize specificity so as to limit the number of false positives. False positive misclassification implies that some patients with virological failure may be missed and we wanted to avoid this. However, at maximum specificity the percentage of false positives (~ 7%) was similar to that at the specificity of the selected cut-off point, albeit with worse prediction accuracy and an increase in false negatives. False negative classification implies that some patients with viral suppression are misclassified as having virological failure. This can be costly in terms of confirmatory virological testing and clinical monitoring and choice of alternative ART regimens, which could strain the system. With the current cut-off point no false negatives were reported in both the test and external datasets, therefore we kept this cut-off point for further practical application.

The model has other predictive and practical advantages in resource limited settings. In these settings, absence of routine monitoring of viral load, pharmacogenetic and drug resistance mutation testing to guide choice of therapy pose a great challenge [58, 59]. In addition, health care system challenges affect accuratediagnosis, patient monitoring and provision of care, making it difficult to identify patients at risk of early virological failure [60]. Therefore this model could guide individualized clinical decisions such as choice of first line ART and clinical (virological and immunological) monitoring. The risk score can readily be calculated by hand.

Conclusion

Three logistic regression based models were developed to predict early virological suppression using 7 baseline demographic and clinical variables. The multitask temporal logistic regression (MTLR) model outperformed the other models in all aspects, demonstrating adequate calibration properties and excellent classification and general predictive performance. The multitask models outperformed the simple logistic regression model. Logistic regression based models are capable of accurately predicting early virological suppression using readily available baseline demographic and clinical variables.