Lasso regularization for leftcensored Gaussian outcome and highdimensional predictors
 200 Downloads
Abstract
Background
Biological assays for the quantification of markers may suffer from a lack of sensitivity and thus from an analytical detection limit. This is the case of human immunodeficiency virus (HIV) viral load. Below this threshold the exact value is unknown and values are consequently leftcensored. Statistical methods have been proposed to deal with leftcensoring but few are adapted in the context of highdimensional data.
Methods
We propose to reverse the BuckleyJames least squares algorithm to handle leftcensored data enhanced with a Lasso regularization to accommodate highdimensional predictors. We present a Lassoregularized BuckleyJames least squares method with both nonparametric imputation using KaplanMeier and parametric imputation based on the Gaussian distribution, which is typically assumed for HIV viral load data after logarithmic transformation. Crossvalidation for parametertuning is based on an appropriate loss function that takes into account the different contributions of censored and uncensored observations. We specify how these techniques can be easily implemented using available R packages. The Lassoregularized BuckleyJames least square method was compared to simple imputation strategies to predict the response to antiretroviral therapy measured by HIV viral load according to the HIV genotypic mutations. We used a dataset composed of several clinical trials and cohorts from the Forum for Collaborative HIV Research (HIV Med. 2008;7:2740). The proposed methods were also assessed on simulated data mimicking the observed data.
Results
Approaches accounting for leftcensoring outperformed simple imputation methods in a highdimensional setting. The Gaussian BuckleyJames method with crossvalidation based on the appropriate loss function showed the lowest prediction error on simulated data and, using real data, the most valid results according to the current literature on HIV mutations.
Conclusions
The proposed approach deals with highdimensional predictors and leftcensored outcomes and has shown its interest for predicting HIV viral load according to HIV mutations.
Keywords
Limit of detection BuckleyJames least squares procedure HIV viral load Drug resistance HIV genotypic mutations Crosssectional studiesBackground
Leftcensoring due to the lower detection limit of an assay is a common problem in many fields including biology, chemistry, and the environmental sciences. One example is the quantification of the human immunodeficiency virus (HIV) viral load in plasma. The sensitivity of assays has improved and the detection threshold has decreased from 10,000 copies/mL to 20 or fewer copies/mL today. Several statistical methods have been proposed to account for leftcensoring of such quantitative variables in crosssectional (with one measure per subject) and longitudinal (with several measures per subject) studies. Standard methods include multiple imputation [1, 2, 3, 4], reverse survival analysis methods [2, 5, 6, 7], quantile regression [8, 9] and censored quantile regression [10, 11]. Furthermore, the Tobit model with censored outcome which is supposed to be normally distributed can be estimated by maximum likelihood [12, 13, 14, 15, 16, 17, 18] or by the BuckleyJames estimator [18, 19]. Indeed, HIV viral load appears to have an underlying Gaussian distribution truncated by the detection limit that justifies the normality hypothesis [13, 14, 15, 17, 18]. As expected, approaches accounting for leftcensoring outperform simple imputation of a constant [2, 4, 13, 14, 15, 16, 18, 20, 21, 22].
Another issue may arise when the number of predictors (p) is high compared to the number of statistical units (n), without excluding the possibility that n<p. This is known as high dimensionality. In the context of HIV infection, this can be illustrated by analyzing the association between the presence of HIV mutations and the response to antiretroviral therapy which is measured by HIV viral load. HIV strains circulating in a given individual can present mutations associated with antiretroviral treatment failure (detectable HIV viral load), also called HIV drug resistance mutations. Thus, genotypic tests allowing the detection of HIV drug resistance mutations are commonly performed in patients starting a new antiretroviral regimen or even in newly HIVinfected patients because of the transmission of resistant strains [23, 24, 25, 26]. Lasso linear [27, 28] and logistic regressions [29], principal component and partial least square logistic regressions [30], and multiple testing correction [31] have been used to deal with more than 100 predictors and fewer than a few hundred of patients, a common situation in this context [32].
These studies use a dichotomized outcome or simple imputation by a constant to circumvent the problem of censoring. One limitation of dichotomizing a continuous outcome is the loss of information and hence power. In addition, success is usually defined as achieving an undetectable HIV viral load. However, the detection limit, although not random, depends on several factors that differ from one study to another. Thus, there is no reason, except convenience, for the detection limit to correspond to the threshold for dichotomization.
We hypothesize that approaches accounting for leftcensoring will exhibit better results compared to simple imputation strategies in a highdimensional setting similar to what has been found in lowdimensional settings.
Some works have simultaneously addressed both censoring and highdimensional problems using the Lasso [33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], partial least squares [44], random forests [45], support vector machines [46], and deep learning [47]. These examples were developed for rightcensored survival data. A main approach to leftcensored data analysis is based on methods typically used with rightcensored survival data such as the BuckleyJames estimator. Leftcensored data are then previously reversed to rightcensored data. While from a statistical point of view, the nature of the outcome (timetoevent or quantitative measurement below a limit of detection) is secondary, this can impact the choice of adequate probability distribution functions and other practical issues.
We propose a Lassoregularized BuckleyJames least squares method with both, nonparametric imputation using KaplanMeier and parametric imputation based on the Gaussian distribution. The nonparametric Buckley–James estimator, which simply replaces censored residuals by their conditional expectations in an iterative way, has been previously applied to leftcensored HIV viral load data in a crosssectional study [18]. On the other hand, the Lasso extension of the nonparametric Buckley–James method has been proposed for rightcensored data [36, 38, 40, 48]. Our contribution consists in using the latter method for leftcensored outcomes and highdimensional predictors. Furthermore, we propose an original parametric version of the BuckleyJames method, which is adapted to the typical assumption of a Gaussian distribution of HIV viral load. We demonstrate the value of these approaches by comparing them to Lasso linear regression with simple imputation [28] for predicting the response to antiretroviral therapy by HIV genotypic mutations.
Our primary objective is to predict as accurately as possible responses in future patients who will switch to a similar regimen. Thus, comparisons are based on mean square prediction error. The prediction performances of the different methods were assessed on simulated data that reproduced the observed data. Then, methods were applied to data obtained in a collaborative study from clinical trials and cohorts provided by the Standardization in Clinical Relevance of HIV Drug Resistance Testing Project from the Forum for Collaborative HIV Research [49]. The actual data presented a moderate censoring rate of 26 %, i.e. a realistic magnitude [18, 50]. However, high (around 50 %) or even severe (around 70 %) censoring rates could be observed in older studies with a high limit of detection (LOD) or particular populations with low treatment failure rate, e.g. HIV controllers [18, 51, 52]. Thus, we also explored the impact of high and severe censoring rates on performance.
We detail how to use publicly available R packages to compute Lasso estimates with leftcensored data.
Finally, we discuss possible extensions and applications of our work.
Methods
Methods to analyze leftcensored outcome
In this section, we review the simplest models and estimation methods used to deal with leftcensoring in crosssectional studies. For a more extensive and comprehensive review of these methods, see [2, 18]. Thereafter, we consider the Lasso extension of those methods that support simple implementations.
The linear model
where X_{i} is a pvector of fixed predictors, Y_{i} is the uncensored continuous random outcome variable, β=(β_{1},⋯,β_{p})^{⊤} is a pvector of unknown regression parameters and ε_{i} are independent and identically normally distributed random variables with mean 0 and constant variance σ^{2}. Let X be the n×p matrix X=(X_{1},…,X_{n})^{⊤} and Y the n×1 vector Y=(Y_{1},…,Y_{n})^{⊤}. The intercept is omitted in the model for simplicity, and all predictor variables are assumed to be standardized (i.e. zero mean and unit variance).
Lasso on complete data
where \(\left \ \mathbf {Y}  \mathbf {X} \boldsymbol {\beta } \right \^{2}_{2} = {\sum \nolimits }_{i=1}^{n} \left (Y_{i}  \mathbf {X}_{i} \boldsymbol {\beta } \right)^{2}\) is the quadratic loss, \(\left \ \boldsymbol {\beta } \right \_{1} = {\sum \nolimits }_{j=1}^{p} \left  \beta _{j} \right \) is the Lasso penalty on the parameter size, and λ>0 controls the amount of regularization. When λ is large enough (which depends on data), all coefficients are forced to be exactly zero. Inversely, λ=0 corresponds to the unpenalized ordinary leastsquares estimate.
This model on complete data (no leftcensored measures) is considered as a reference when comparing the other methods applied to incomplete datasets that include leftcensored values.
The Tobit model
where \(\delta _{i} = \mathbb {I}_{\left (Y_{i} > \text {LOD}\right)}\) is a censoring indicator. The idea behind the Tobit regression model is to deal with the leftcensored variable Z as the outcome of a normally distributed latent variable Y.
Simple imputation
Simple imputation is widely used for its simplicity. However, replacing any censored observation by a single value may lead to biased parameter estimates.
Maximum likelihood estimation
with \(f_{G}\left (u, \mathbf {X}_{i}, \boldsymbol {\beta }, \sigma ^{2}\right) = \frac {e^{\frac {\left (u  \mathbf {X}_{i} \boldsymbol {\beta }\right)^{2}}{2 \sigma ^{2}}}}{\sqrt {2 \pi \sigma ^{2}}}\) the Gaussian probability density function of Y_{i} with mean X_{i}β and constant variance σ^{2} evaluated at u and \(F_{G}\left (v, \mathbf {X}_{i}, \boldsymbol {\beta }, \sigma ^{2}\right) = \int _{ \infty }^{v}f_{G}\left (u, \mathbf {X}_{i}, \boldsymbol {\beta }, \sigma ^{2}\right) \, \mathrm {d}u \) is the corresponding Gaussian cumulative distribution function evaluated at v. Let \(\hat {\boldsymbol {\beta }}_{MLE}\) be the maximum likelihood estimation obtained by maximizing (7). Extensions to other distributions have also been explored [54]. Several works have shown the superiority of this method [18, 20, 21, 22]. However, when the parametric model is misspecified, the sample size is small or the percent censoring is high, the maximumlikelihood estimation method has been shown to perform poorly [2].
The Lasso penalty applied to some likelihood function has become an established and relatively standard technique. However, when the likelihood function is a more complex function of the model parameter, such as the likelihood function for the Tobit model (7), adding a nondifferentiable penalty leads to a computational challenging optimization.
Quantile regression and censored quantile regression
Lassoregularized least absolute deviations regression has been investigated in the literature (e.g. [55]).
The Lasso extension of censored quantile regression (basically, censored LAD) has been analyzed for rightcensored survival data [41, 42, 56] and specifically for leftcensored data [57, 58, 59, 60, 61, 62]. Yu et al. [58] and Alhamzawi et al. [61] proposed Bayesian approaches using different hyperparameters priors. These methods rely on computationally intensive algorithms. In practice, applications are limited to the n>>p case. Others [57, 59, 60, 62] derived theoretical properties of the Lassoregularized censored least absolute deviations regression, but the algorithmic development was not a priority in these works and the practical use was limited to n>>p or not addressed. To our knowledge, there are no publicly available software tools that implement the Lasso extension of Powell’s approach and no simple implementation relying on existing packages seems straightforward.
Nonparametric BuckleyJames
where F(u,X_{i},β) is now the (unknown) cumulative distribution function of M−Y_{i} with mean M−X_{i}β evaluated at u, which can be estimated, for example, by KaplanMeier. The BuckleyJames estimate, \(\hat {\boldsymbol {\beta }}_{NonParBJ}\), can be computed using a semiparametric iterative algorithm that alternates between imputation of censored values according to (9) and leastsquares estimation.
The main drawback of this method is that convergence of the algorithm is not guaranteed. Due to the discontinuous nature of the estimating function (formulation (8) makes \(\hat {\boldsymbol {\beta }}_{NonParBJ}\) to be a piecewise linear function in β), the iterative procedure may oscillate between different parameter values. The problem is of practical importance in situations where the effect of predictors is small or in small samples [63] (which could be worse in highdimensional settings). To circumvent this problem, a onestep algorithm that stops at the first iteration is used in some works [36, 64]. This approach is close to a substitution method in which values below the detection limit are replaced by expected values of the missing measurements, provided they are less than the detection limit [65].
Gaussian BuckleyJames
Graphical illustration in a lowdimensional setting
To illustrate the difference between estimation methods we generated data from the simple linear model (p=1): Y_{i}=X_{i}β+ε_{i} with i=1,⋯,n, X∼N(0,1), ε∼N(0,σ^{2}). β was set to 10 and σ^{2} was chosen such that the signaltonoise ratio was 4 :3. A limit of detection was then fixed to obtain the desired censoring rate: moderate, 20 %, high, 50 % or severe, 70 %.
Notice that simple imputation by LOD and LOD/2 are the most distant regression lines from the true and gold standard lines, in an opposite way: simple imputation by LOD tends to overestimate the response values while simple imputation by LOD/2 tends to underestimate them.
Maximum likelihood estimation shows one of the best behaviors, but the computational complexity dramatically increases with p (results not showed). Mean and median regressions with simple imputation by LOD are quite close and are the closest when the estimation situation is easy (high n, low censoring rate). The censored LAD shows better results than censored mean regression (MLE and BuckleyJames) for small sample size while the inverse is observed when n=100. Gaussian BuckleyJames and MLE are identical, but their differences increase when p increases (results not showed). In this i.i.d. generated from a Gaussian distribution example, the Gaussian BuckleyJames estimate shows better behavior than nonparametric BuckleyJames, the difference being higher when n is small.
Tuning parameter selection
CV is evaluated on a grid of λ–values. The highest value, λ_{max}, corresponds to the smallest value of λ for which all coefficients are zero. The lowest value, λ_{min}, corresponds to the unpenalized solution (when feasible). We choose the λ value that minimizes the CV function.
where n_{k} is the sample size of D_{k}. However, Y is a latent variable not fully observed due to the detection limit. This loss function could be used only for the gold standard in (2), with simulated data. Again, the simplest imputation strategy consists in replacing Y with Z, in D_{k}. Alternatively, BuckleyJames strategies could replace censored Y_{i} values in the test data D_{k} by their conditional expectation estimated using the learning data [48].
where \(n_{k}^{\text {unc}}\) is the number of uncensored observations in D_{k}. The loss function L_{G} in (15) is proportional and equivalent to the negative Gaussian loglikelihood loss function, but allows for comparison with the squared loss in (14).
Implementation issues
All statistical analyses, comparisons and implementations were performed using the computing environment R (R Development Core Team, 2017) [69]. We used the function cv.glmnet from package glmnet [70] to choose the optimal λ value of Lasso linear regression on complete data (GoldS) and Lasso linear regression with simple substitution of leftcensored values by the detection limit (LOD). We implemented the Lasso nonparametric BuckleyJames (NonParBJ) using the bujar package [71]. We modified the function to support stratified Kfold crossvalidation and conserve the same proportion of censoring in all the folds. Lasso Gaussian BuckleyJames (Gaussian BJ) was implemented in a new function cvGaussBJ. Algorithm 1 specifies how to solve the problem. The stopping criterion is based on the difference between current and previous regression coefficient estimates, variance estimates, and imputed data. Because of the tendency to oscillate between different parameter values of the iterative procedure, the algorithm is also stopped if the number of oscillations is high [40]. Alternatively, we also considered the onestep algorithm, which stops at the first iteration [36, 64]. Crossvalidation based on both imputation and loss function accounting for censored and uncensored contributions is considered. The Lasso estimation step depends on package glmnet.
All these implementations and an artificial example are available at: https://github.com/psBiostat/leftcensoredLasso.
Prediction of HIV viral load from HIV genotypic mutations: real and simulated data
HIV is highly replicative and thus presents high mutation and recombination rates which could lead to the development of HIV drug resistance and consequently reduce the efficacy of antiretroviral treatment. To optimize the control of the evolution of HIV drug resistance, HIV viral load is routinely monitored to identify treatment failure, and HIV genotypic tests are commonly performed before a switch to a new treatment regimen in patients already treated or at the initiation of the first treatment in naive HIVinfected patients [72].
Our objective is to compare methods that handle leftcensoring by conditional imputation with methods that handle leftcensoring by imputing a single constant value (that is, the Lassoregularized linear regression with simple imputation by LOD and LOD /2) to predict of HIV viral load by HIV genotypic mutations. The methods accounting for leftcensoring by imputing the estimated conditional mean given censoring and predictor values are the Lassoregularized BuckleyJames least square algorithms (with/without Gaussian assumption, with complete convergence/1step, using crossvalidation based on imputation/loss function accounting for censored and uncensored contributions).
Real data
The database used in this study was provided by the Standardization and Clinical Relevance of HIV Drug Resistance Testing Project for the Forum for Collaborative HIV Research [49]. Patients included in this study were all treatmentexperienced and switched to an abacavircontaining regimen. The investigated drug, abacavir, is a nucleoside reverse transcriptase inhibitor (NRTI) that blocks HIV reverse transcriptase.
The sample size n=99 was slightly smaller than the number of predictors p=121. 54 of the 121 predictors correspond to the presence or absence of specific mutations in the reverse transcriptase gene (RTG), which were reported to be probably associated with resistance to abacavir, multiNRTI, NRTI (other than abacavir) or nonnucleoside reverse transcriptase inhibitors (NNRTI) at the time of the study [73, 74]. The number of mutations reported to be probably associated with resistance to abacavir or multiNRTI is low (14%). The other 67 predictors correspond to the presence or absence of specific mutations in the protease gene (PG) reported to be probably associated with resistance to one or several protease inhibitors (PI) at the time of the study [73, 74]. The number of molecules, including abacavir, ranged from 1 to 6 (with the median number of molecules being 3 and interquartile range 2). In particular, a PI was prescribed in 59% of the patients and 43% received an NNRTI. The response variable is the logHIV viral load measured at t_{8} (8 weeks after treatment initiation at t_{0}). LOD was fixed at 100 copies/mL and the censoring rate was moderate (26%).
Generation of simulated data

\(Y_{i}^{(0)}\), the HIV viral load at t_{0} is generated by a normal distribution with mean 12 (log10 copies/mL) and variance 1.

\(\beta _{1}^{(0)}\) represents the change of the slope between the HIV viral load on the day of treatment, t_{0}, and 8 weeks later, at t_{8}, when no mutations are present and for 1 log_{10}/mL higher concentration of viral load at t_{0}.
We fix β_{0} and \(\beta _{1}^{(0)}\) to obtain the desired censoring rates: 20% (moderate), 50% (high), and 70% (severe).

X_{(n×p)}, representing the presence or absence of HIV mutations, is generated by a multinomial distribution with mean 0.15 (the fixed prevalence for all the 100 mutations) and covariance matrix Σ where Σ_{ij}=0.4^{i−j} (the closer the mutations, the more positively they are correlated).

β=(β_{1},⋯,β_{p})^{⊤}. Among p=100 candidate mutations only 10% are relevant with effects β_{j}=1, if j=1,…,10 and 0 if j>10. A 1unit increase in HIV viral load is expected per occurrence of these relevant mutations for a given baseline HIV viral load.

ε is generated from a normal distribution with mean 0 and variance σ^{2} chosen such that the signaltonoise ratio is fixed at 3:1.
Results
Simulation results
The imputation by LOD /2 led to poorer results than the imputation by LOD. Thus, for the simple imputation, only LOD imputation (LOD) results are shown. For the Gaussian BuckleyJames algorithms (Gaussian BJ), the error is calculated by using both crossvalidation with imputation and crossvalidation with the loss function indicated in (15), but only the best results are shown.
The Gaussian BuckleyJames method presented an oscillating behavior in 9.5% of the generated samples when the convergence rate was 20%. This percentage rose to 82.5% and 95.0% when the convergence rates were 50% and 70%, respectively. For the Gaussian BuckleyJames using the 1step algorithm (Gaussian BJ 1Step), results using the two crossvalidation approaches were almost identical. Nevertheless, for the Gaussian BuckleyJames with complete convergence, a notable improvement was obtained when applying (15).
The higher the rate of censoring, the less information is available to train the models and, unsurprisingly, the higher is the prediction error. For a moderate rate of censoring (20%), all methods show a good performance close to that of the gold standard GoldS. When the rate of censoring is 50%, Gaussian BJ shows the lowest prediction error, followed by Gaussian BJ 1step, NonParBJ and finally simple imputation, which shows more errors. The same patterns but more pronounced were observed with a severe rate of censoring (70%). Taking the knowledge about the distribution into account appears to have only a slight impact. Simple imputation yields the poorest results. In addition, it showed high variability with some extreme errors.
Application to real data
The Lassoregularized BuckleyJames least square algorithm that showed the best behavior in the simulation study (with complete convergence and crossvalidation based on the loss function L_{G} in (15)) was applied to real data, as well as the Lassoregularized nonparametric BuckleyJames method and simple imputation (by LOD and LOD /2).
Regularization parameters were estimated by stratified crossvalidation in order to ensure that each fold had the same proportion of censoring as in the corresponding data set (26%). In addition, because some mutations were relatively infrequent, we used 20fold crossvalidation. Indeed, the higher the number of folds, the lower the probability of randomly obtaining test sets with no subject exposed to infrequent HIV drug resistance mutations.
When applying the Gaussian BuckleyJames method to real data, no oscillating behavior was observed.
Number of HIV genotypic mutations  Gaussian BJ  NonPar BJ  LOD  LOD/2 
present in real data study  
5 in RTG probably associated  3 (60%)  3 (60%)  1 (20%)  1 (20%) 
with abacavir resistance  
13 in RTG probably associated  7 (54%)  4 (31%)  4 (31%)  4 (31%) 
with multiNRTI resistance  
6 in RTG probably associated  3 (50%)  2 (33%)  1 (17%)  1 (17%) 
with NRTI resistance (other than abacavir)  
30 in RTG probably associated  12 (40%)  11 (37%)  8 (27%)  7 (23%) 
with NNRTI resistance  
67 in PG probably associated  40 (60%)  22 (33%)  15 (22%)  14 (21%) 
with PI resistance  
121 Total  65 (54%)  42 (35%)  29 (24%)  27 (22%) 
Discussion
Simple imputation of the detection limit or of half of this limit is an ad hoc approach to address leftcensored outcome data. However, in standard (lowdimensional) settings, it leads to biased estimates of parameters and standard errors. In our highdimensional simulation study, simple imputation using Lassoregularized leastsquares showed poor performance. As in lowdimensional settings, approaches accounting for leftcensoring outperformed simple imputation.
In this work, we propose a Lassoregularized Gaussian BuckleyJames algorithm, according to the usual Gaussian assumption of logtransformed HIV viral load. Because of the wellknown convergence problems of the iterative BuckleyJames procedure, we implemented two algorithms, the first algorithm running until convergence and the second one being stopped after one step [36, 64]. This onestep algorithm showed similar results in the simulation study. Other solutions have been proposed to deal with convergence problems in lowdimensional settings [39, 64] and could be investigated in future research.
As in other works [48], we implemented a crossvalidation criterion for the tuning parameter based on imputing values to Y_{i} in the test set from conditional expectations estimated using the learning set. We also proposed a crossvalidation criterion based on a loss function that accounts for the different contribution of censored and uncensored values. Almost identical results were obtained when applying the two crossvalidation criteria to the onestep algorithm. However, when running the algorithm until convergence, better results were obtained with the crossvalidation criterion based on a loss function that accounts for censored and uncensored contributions.
On the other hand, we reversed the Lassoregularized nonparametric BuckleyJames method previously applied to rightcensored survival data [36, 38, 40, 48] in order to apply to leftcensoring due to detection limits. Foreseeably, in our homoscedastic Gaussian outcome data scenario, the Gaussian BuckleyJames showed better behavior than the nonparametric algorithm. However, accounting for the knowledge about the distribution seems to have had a slight influence. When the Gaussian assumption is violated, nonparametric imputation using Kaplan Meier is perhaps the best option.
We provide a publicly available R code to compute the methods introduced in this work (https://github.com/psBiostat/leftcensoredLasso). It would be interesting to compare the Lassoregularized BuckleyJames least squares method to Lassoregularized censored LAD method. The Lasso extension of censored LAD has been proposed in different works [41, 42, 56, 57, 58, 59, 60, 61, 62]. However, to our knowledge, there is no publicly available implementation, and no simple implementation relying on existing packages seem straightforward. Moreover, several works have shown the superiority of maximum likelihood estimation in lowdimensional settings when the Gaussian assumption is valid [18, 20, 21, 22]. Nevertheless, optimization strategies for complex likelihood functions (such as that in Eq. 7) including penalties that are not smooth are not obvious.
To illustrate the application of the methods on real data, we consider a data set from the Standardization and Clinical Relevance of HIV Drug Resistance Testing Project for the Forum for Collaborative HIV Research. The data set used to illustrate the initial data set is characterized by a sample sizetopredictors ratio of around 1. There is no gold standard to measure and compare predictive performance of the different methods when using censored outcome data. All patients were being treated with abacavir, an NRTI, so we expected our methods to select a high number of HIV genotypic mutations known to contribute to abacavir and NRTI resistance. Furthermore, a high number of patients were on PI and/or NNRTIcontaining regimens, and a selection of several HIV genotypic mutations reported to be probably associated with resistance to any of these molecules was also expected [73, 74]. In that sense, the Gaussian and nonparametric BuckleyJames methods showed more coherent results with the literature compared to simple imputation.
Otherwise, the data presented a moderate censoring rate of 26%, which is a realistic magnitude [18, 50] in studies measuring HIV viral load. However, high or even severe censoring rates were found in older studies with a high limit of detection (LOD) or particular populations with a low treatment failure rate [18, 51, 52]. Furthermore, leftcensoring due to the lower detection limit of an assay is a problem in many fields such as biology, immunology, chemistry, and the environmental sciences in which high censoring rates may be frequent. Our simulation study shows that the difference in performance between Lassoregularized BuckleyJames methods and Lassoregularized simple imputation methods increased with the censoring rate.
In our simulations and real application, the detection threshold was the same for all subjects. The detection threshold may vary among subjects, for example, in multicentric studies. Our R code also supports multiple lower limits of quantification. However, the findings should be interpreted with caution: differences in technological equipment could be a confounding factor that might help explain the differences in patient response to HIV treatment (in addition to HIV mutations). Adjusting or stratifying for the hospital would then be necessary.
In this study we focused on the prediction performance of Lassoregularized methods. In clinical applications, even when prediction accuracy is the main objective, researchers aim to identify which predictors are more strongly associated with outcome. Our proposal could be easily extended or adapted to support other Lassotype penalties. When the primary goal is to infer the set of truly relevant variables, the adaptive Lasso and the bootstrapenhanced Lasso could thus be considered.
Notes
Acknowledgements
We acknowledge members of the Standardization and Clinical Relevance of HIV Drug Resistance Testing Project for the Forum for Collaborative HIV Research. We thank “Sidaction, Ensemble contre le Sida”, France, for their continuous support. We would like to thank Binbin Xu, postdoctoral researcher at SISTM research team from Inserm BPH U1219 & Inria BSO, for his help on testing the R code.
Funding
This work was partially supported by the “Investissements d’Avenir” program managed by the ANR under reference ANR10LABX77
Availability of data and materials
R code and artificial example are available at https://github.com/psBiostat/leftcensoredLasso.
Authors’ contributions
PS developed the algorithms and corresponding R code, carried out the statistical analysis and helped to draft the manuscript. MA developed the algorithms, revised the R code and drafted the manuscript. RT designed and supervised the applied research. RT and LW interpreted the results of the analysis. DC revised the methodology. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
 1.Paxton W, Coombs R, McElrath M, Keefer M, Hughes J, Sinangil F, Chernoff D, Demeter L, B BW, Corey L. Longitudinal analysis of quantitative virologic measures in human immunodeficiency virusinfected subjects with > or = 400 CD4 lymphocytes: implications for applying measurements to individual patients. National Institute of Allergy and Infectious Diseases AIDS Vaccine Evaluation Group. J Infect Dis. 1997; 175(2):247–54.PubMedGoogle Scholar
 2.Helsel DR. More than obvious: Better methods for interpreting nondetect data. Environ Sci Technol. 2005; 39(20):419–23.Google Scholar
 3.Lee M, Kong L, Weissfeld L. Multiple imputation for leftcensored biomarker data based on Gibbs sampling method. Stat Med. 2012; 31:1838–48.PubMedGoogle Scholar
 4.Del Greco M F, Pattaro C, Minelli C, Thompson JR. Bayesian analysis of censored response data in familybased genetic association studies. Biom J. 2016; 58(5):1039–53.PubMedGoogle Scholar
 5.Marschner I, Betensky R, DeGruttola V, Hammer S, Kuritzkes D. Clinical trials using HIV1 RNAbased primary endpoints: Statistical analysis and potential biase. J Acquir Immune Defic Syndr Hum Retrovirol. 1999; 20(3):220–7.PubMedGoogle Scholar
 6.Gillespie BW, Chen Q, Reichert H, Franzblau A, Hedgeman E, Lepkowski J, Adriaens P, Demond A, Luksemburg W, Garabrant DH. Estimating population distributions when some data are below a limit of detection by using a reverse KaplanMeier estimator. Epidemiology. 2010; 21:64–70.Google Scholar
 7.Dinse G, Jusko A, Ho L, Annam K, Graubard B, HertzPicciotto I, Miller F, Gillespie B, Weinberg C. Accomodating measurements below a limit of detection: A novel application of Cox regression. Am J Epidemiol. 2014; 179(8):1018–24.PubMedPubMedCentralGoogle Scholar
 8.Wang HJ, Zhu Z, Zhou J. Quantile regression in partially linear varying coefficient models. Ann Stat. 2009; 37(6B):3841–66.Google Scholar
 9.Eilers PH, Röder E, Savelkoul HF, van Wijk RG. Quantile regression for the statistical analysis of immunological data with many nondetects. BMC Immunol. 2012; 13:13–37.Google Scholar
 10.Powell JL. Least absolute deviations estimation for the censored regression model. J Econ. 1984; 25:303–25.Google Scholar
 11.Powell JL. Censored regression quantiles. J Econom. 1986; 32:143–55.Google Scholar
 12.Tobin J. Estimation of relationships for limited dependent variables. Econometrica. 1958; 26:24–36.Google Scholar
 13.Hughes JP. Mixed effects models with censored data with application to HIV RNA levels. Biometrics. 1999; 55:625–9.PubMedGoogle Scholar
 14.JacqminGadda H, Thiébaut R, Chêne G, Commenges D. Analysis of leftcensored longitudinal data with application to viral load in HIV infection. Biostatistics. 2000; 1(4):355–68.PubMedGoogle Scholar
 15.Lynn HS. Maximum likelihood inference for leftcensored HIV RNA data. Stat Med. 2001; 20:33–45.PubMedGoogle Scholar
 16.Nie L, Chu H, Liu C, Cole SR, Vexler A, Schisterman EF. Linear regression with an independent variable subject to a detection limit. Epidemiology. 2010; 21:17–24.Google Scholar
 17.Fu P, Hughes J, Zeng G, Hanook S, Orem J, Mwanda O, Remick S. A comparative investigation of methods for longitudinal data with limits of detection through a case study. Stat Methods Med Res. 2016; 25(1):153–66.PubMedGoogle Scholar
 18.Wiegand RE, Rose CE, Karon JM. Comparison of models for analyzing twogroup, crosssectional data with a gaussian outcome subject to a detection limit. Stat Methods Med Res. 2016; 25(6):2733–49.PubMedGoogle Scholar
 19.Buckley J, James I. Linear regression with censored data. Biometrika. 1979; 66:429–36.Google Scholar
 20.Hewett P, Ganser GH. A comparison of several methods for analyzing censored data. Ann Occup Hyg. 2007; 51:611–32.PubMedGoogle Scholar
 21.Uh HW, Hartgers FC, Yazdanbakhsh M, HouwingDuistermaat JJ. Evaluation of regression methods when immunological measurements are constrained by detection limits. BMC Immunol. 2008; 9(1):59.PubMedPubMedCentralGoogle Scholar
 22.Kafatos G, Andrews N, McConway KJ, Farrington P. Regression models for censored serological data. J Med Microbiol. 2013; 62(Pt 1):93–100.PubMedGoogle Scholar
 23.Hirsch MS, Günthard HF, Schapiro JM, Vézinet FB, Clotet B, Hammer SM, Johnson VA, Kuritzkes DR, Mellors JW, Pillay D, et al. Antiretroviral drug resistance testing in adult HIV1 infection: 2008 recommendations of an International AIDS SocietyUSA panel. Clin Infect Dis. 2008; 47(2):266–85.PubMedGoogle Scholar
 24.Wittkop L, Günthard H, de Wolf F, Dunn D, CozziLepri A, de Luca A, Kücherer C, Obel N, von Wyl V, Masquelier B, Stephan C, Torti C, Antinori A, Garcia F, Judd A, Porter K, Thiébaut R, Castro H, van Sighem A, Colin C, Kjaer J, Lundgren J, Paredes R, Pozniak A, Clotet B, philipps A, Pillay D, Chêne G, study group EC. Effects of transmitted drug resistance on virological and immunological response to initial combination antiretroviral therapy for HIV (eurocoordchain joint project): a european multicohort study. Lancet Infect Dis. 2011; 11(5):363–71.PubMedGoogle Scholar
 25.Hofstra LM, Sauvageot N, Albert J, Alexiev I, Garcia F, Struck D, Van de Vijver DA, Åsjö B, Beshkov D, Coughlan S, et al. Transmission of HIV drug resistance and the predicted effect on current firstline regimens in europe. Clin Infect Dis. 2016; 62(5):655–63.PubMedGoogle Scholar
 26.Wensing AM, Calvez V, Günthard HF, Johnson VA, Paredes R, Pillay D, Shafer RW, Richman DD. 2017 update of the drug resistance mutations in HIV1. Top Antivir Med. 2017; 24(4):132.PubMedGoogle Scholar
 27.Rabinowitz M, Myers L, Banjevic M, Chan A, SweetkindSinger J, Haberer J, McCann K, Wolkowicz R. Accurate prediction of HIV1 drug response from the reverse transcriptase and protease amino acid sequences using sparse models created by convex optimization. Bioinformatics. 2006; 22(5):541–9.PubMedGoogle Scholar
 28.Beerenwinkel N, Montazeri H, Schuhmacher H, Knupfer P, von Wyl V, Furrer H, Battegay M, Hirschel B, Cavassini M, Vernazza P, Bernasconi E, Yerly S, Böni J, Klimkait T, Cellerai C, Günthard HF, Study TSHC. The individualized genetic barrier predicts treatment response in a large cohort of HIV1 infected patients. PLoS Comput Biol. 2013; 9(8):1–11.Google Scholar
 29.CozziLepri A, Prosperi MCF, Kjær J, Dunn D, Paredes R, Sabin CA, Lundgren JD, Phillips AN, Pillay D, for the EuroSIDA, the United Kingdom CHIC/United Kingdom HDRD Studies. Can linear regression modeling help clinicians in the interpretation of genotypic resistance data? an application to derive a lopinavirscore. PLoS ONE. 2011; 6(11):1–9.Google Scholar
 30.Wittkop L, Commenges D, Pellegrin I, Breilh D, Neau D, Lacoste D, Pellegrin JL, Chêne G, Dabis F, Thiébaut R. Alternative methods to analyse the impact of HIV mutations on virological response to antiviral therapy. BMC Med Res Methodol. 2008; 8(1):68.PubMedPubMedCentralGoogle Scholar
 31.Assoumou L, Houssaïna A, Corstagliola D, Flandre P, Standardization and clinical relevance of HIV drug resistance testing project from the forum for collaborative HIV research. Relative contributions of baseline patient characteristics and the choice of statistical methods to the variability of genotypic resistance scores: the example of didanosine. J Antimicrop Chemother. 2010; 65(4):752–60.PubMedGoogle Scholar
 32.Rhee S, Taylor J, Wadhera G, BenHur A, Brutlag D, Shafer R. Genotypic predictos of human immunodeficiency cirus type 1 drug resistance. Proc Natl Acad Sci USA. 2006; 103(46):17355–60.PubMedGoogle Scholar
 33.Tibshirani R. The Lasso method for variable selection in the Cox model. Stat Med. 1997; 16:385–95.PubMedGoogle Scholar
 34.Huang J, Ma S, Xie H. Regularized estimation in the accelerated failure time model with highdimensional covariates. Biometrics. 2006; 62:813–20.PubMedGoogle Scholar
 35.Datta S, LeRademacher J, Datta S. Predicting patient survival from microarray data by accelerated failure time modeling using partial least squares and Lasso. Biometrics. 2007; 63:259–71.PubMedGoogle Scholar
 36.Johnson BA. Variable selection in semiparametric linear regression with censored data. J R Stat Soc Ser B Stat Methodol. 2008; 70:351–70.Google Scholar
 37.Wang S, Nan B, Zhu J, Beer DG. Doubly penalized BuckleyJames method for survival data with highdimensional covariates. Biometrics. 2008; 64(1):132–40.PubMedGoogle Scholar
 38.Cai T, Huang J, Tian L. Regularized estimation for the accelerated failure time model. Biometrics. 2009; 65:394–404.PubMedPubMedCentralGoogle Scholar
 39.Ueki M. A note on automatic variable selection using smooththreshold estimating equations. Biometrika. 2009; 96(4):1005–11.Google Scholar
 40.Wang Z, Wang CY. Buckleyjames boosting for survival analysis with highdimensional biomarker data. Stat Appl Genet Mol Biol. 2010; 9(1):24.PubMedCentralGoogle Scholar
 41.Shows JH, Lu W, Zhang HH. Sparse estimation and inference for censored median regression. J Stat Plan Infer. 2010; 140:1903–17.Google Scholar
 42.Wang HJ, Zhou J, Li Y. Variable selection for censored quantile regression. Stat Sin. 2013; 23(1):145–67.PubMedPubMedCentralGoogle Scholar
 43.Chung M, Long Q, Johnson BA. A tutorial on rankbased coefficient estimation for censored data in smalland largescale problems. Stat Comput. 2013; 23(5):601–14.PubMedGoogle Scholar
 44.Huang X, Pan W, Park S, Han X, Miller LW, Hall J. Modeling the relationship between LVAD support time and gene expression changes in the human heart by penalized partial least squares. Bioinformatics. 2004; 20(6):888–94.PubMedGoogle Scholar
 45.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008; 2:841–60.Google Scholar
 46.Wang Y, Chen T, Zeng D. Support vector hazards machine: A counting process framework for learning risk scores for censored outcomes. J Mach Learn Res. 2016; 17(167):1–37.PubMedPubMedCentralGoogle Scholar
 47.Van der Burgh HK, Schmidt R, Westeneng HJ, de Reus MA, van den Berg LH, van den Heuvel MP. Deep learning predictions of survival based on MRI in amyotrophic lateral sclerosis. NeuroImage Clin. 2017; 13:361–9.PubMedGoogle Scholar
 48.Johnson BA. On lasso for censored data. Electron J Stat. 2009; 3:485–506.Google Scholar
 49.CozziLepri A. Initiatives for developing and comparing genotype interpretation systems: external validation of existing rulebased interpretation systems for abacavir against virological response. HIV Med. 2008; 9(1):27–40.PubMedGoogle Scholar
 50.Marks G, Gardner LI, Craw J, Giordano TP, Mugavero MJ, Keruly JC, Wilson TE, Metsch LR, Drainoni ML, Malitz F. The spectrum of engagement in HIV care: do more than 19% of HIVinfected persons in the US have undetectable viral load?. Clin Infect Dis. 2011; 53(11):1168–9.PubMedGoogle Scholar
 51.Dao CN, Patel P, Overton ET, Rhame F, Pals SL, Johnson C, Bush T, Brooks JT, Study to Understand the Natural History of HIV and AIDS in the Era of Effective Therapy (SUN) Investigators. Low vitamin D among HIVinfected adults: prevalence of and risk factors for low vitamin D levels in a cohort of HIVinfected adults and comparison to prevalence among adults in the US general population. Clin Infect Dis. 2011; 52(3):396–405.PubMedGoogle Scholar
 52.Leon A, Perez I, RuizMateos E, Benito JM, Leal M, LopezGalindez C, Rallon N, Alcami J, LopezAldeguer J, Viciana P, et al. Rate and predictors of progression in elite and viremic HIV1 controllers. AIDS. 2016; 30(8):1209–20.PubMedGoogle Scholar
 53.Tibshirani R. Regression shrinkage and selection via the Lasso. J R Stat Soc Ser B Methodol. 1996; 58(1):267–88.Google Scholar
 54.Sigrist F, Stahel WA. Using the censored gamma distribution for modeling fractional response variables with an application to loss given default. ASTIN Bull J Int Actuar Assoc. 2011; 41(02):673–710.Google Scholar
 55.Belloni A, Chernozhukov V. L1penalized quantile regression in highdimensional sparse models. Ann Stat. 2011; 39(1):82–130.Google Scholar
 56.Xue X, Xie X, Strickler HD. A censored quantile regression approach for the analysis of time to event data. Stat Methods Med Res. 2018; 27(3):955–65.PubMedGoogle Scholar
 57.Zhanfeng W, Yaohua W, Lincheng Z. A lassotype approach to variable selection and estimation for censored regression model. Chin J Appl Probab Stat. 2010; 26(1):66–80.Google Scholar
 58.Yue YR, Hong HG. Bayesian tobit quantile regression model for medical expenditure panel survey data. Stat Model. 2012; 12(4):323–46.Google Scholar
 59.Liu X, Wang Z, Wu Y. Group variable selection and estimation in the tobit censored response model. Comput Stat Data Anal. 2013; 60:80–9.Google Scholar
 60.Zhou X, Liu G. LADlasso variable selection for doubly censored median regression models. Commun Stat Theory Methods. 2013; 45(12):3658–67.Google Scholar
 61.Alhamzawi R. Bayesian elastic net tobit quantile regression. Commun Stat Simul Comput. 2016; 45(7):2409–27.Google Scholar
 62.Müller P, van de Geer S. Censored linear model in high dimensions. TEST. 2015; 25(1):75–92.Google Scholar
 63.Peter Wu CS, Zubovic Y. A largescale monte carlo study of the BuckleyJames estimator with censored data. J Stat Comput Simul. 1995; 51(24):97–119.Google Scholar
 64.Wang YG, Zhao Y, Fu L. The Buckley–James estimator and induced smoothing. Aust N Z J Stat. 2016; 58(2):211–25.Google Scholar
 65.Gleit A. Estimation for small normal data sets with detection limits. Environ Sci Technol. 1985; 19(12):1201–6.PubMedGoogle Scholar
 66.Johnson BA, Long Q, Chung M. On path restoration for censored outcomes. Biometrics. 2011; 67:1379–88.PubMedPubMedCentralGoogle Scholar
 67.Zhao SD, Lee D, Li Y. The Dantzig selector for censored linear regression models. Stat Sin. 2014; 24(1):251–68.PubMedPubMedCentralGoogle Scholar
 68.DiRienzo AG. Parsimonious covariate selection with censored outcomes. Biometrics. 2016; 72:452–62.PubMedGoogle Scholar
 69.R Core Team. R: A language and environment for statistical computing. Vienna: R foundation for statistical computing; 2017. ISBN 3900051070, http://www.Rproject.org.Google Scholar
 70.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J Stat Softw. 2010; 33(1):1–22.PubMedPubMedCentralGoogle Scholar
 71.Wang Z, Wang MZ, Suggests T. bujar: BuckleyJames regression for survival data with highdimensional covariates. 2015. R package version 0.21. https://CRAN.Rproject.org/package=bujar.
 72.Iyidogan P, Anderson KS. Current perspectives on HIV1 antiretroviral drug resistance. Viruses. 2014; 6(10):4095–139.PubMedPubMedCentralGoogle Scholar
 73.Shafer RW, Schapiro JM. HIV1 drug resistance mutations: an updated framework for the second decade of HAART. AIDS Rev. 2008; 10(2):67.PubMedPubMedCentralGoogle Scholar
 74.Johnson VA, BrunVézinet F, Clotet B, Gunthard H, Kuritzkes DR, Pillay D, Schapiro JM, Richman DD. Update of the drug resistance mutations in HIV1: December 2009. Top HIV Med. 2009; 17(5):138–45.PubMedGoogle Scholar
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.