Risk Factors Associated with HIV Acquisition in Males Participating in HIV Vaccine Efficacy Trials in South Africa

In South Africa, HIV acquisition risk has been studied less in people assigned male at birth. We studied the associations between risk behaviors, clinical features and HIV incidence amongst males in two South African HIV preventive vaccine efficacy trials. We used Cox proportional hazards models to test for associations between demographics, sexual behaviors, clinical variables and HIV acquisition among males followed in the HVTN 503 (n = 219) and HVTN 702 (n = 1611) trials. Most males reported no male sexual partners (99.09% in HVTN 503) or identified as heterosexual (88.08% in HVTN 702). Annual HIV incidence was 1.39% in HVTN 503 (95% CI 0.76–2.32%) and 1.33% in HVTN 702 (95% CI 0.80–2.07%). Increased HIV acquisition was significantly associated with anal sex (HR 6.32, 95% CI 3.44–11.62), transactional sex (HR 3.42, 95% CI 1.80–6.50), and non-heterosexual identity (HR 16.23, 95%CI 8.13–32.41) in univariate analyses and non-heterosexual identity (HR 14.99, 95% CI 4.99–45.04; p < 0.01) in multivariate analysis. It is appropriate that prevention efforts in South Africa, although focused on the severe epidemic in young women, also encompass key male populations, including men who have sex with men, but also men who engage in anal or transactional sex. Supplementary Information The online version contains supplementary material available at 10.1007/s10461-023-04025-z.


Enrolled in HVTN 702
A literature review was conducted to assess prior evidence of HIV risk factors in African men. If a risk factor was found to be statistically significant predictor of HIV based on a multivariate model in at least two papers, it was added to the list of published risk factors. Published risk factors that were measured in HVTN 702 and 503 in some form -although not necessarily over the same time period or with the same categories -were included in the pre-specified multivariate model.
The table below lists the variables in the pre-specified multivariate model, and the papers that identified these as predictors of HIV in African men.

Details of imputation procedure
Missing baseline variables were imputed using the R package 'mice'. The package imputes categorical variables using polytomous regression, binary variables using logistic regression, and continuous variables using predictive mean matching. The entire set of baseline variables (Tables 1 and 2) and HIV outcomes was used as the basis for the imputation. A total of 100 imputed datasets were generated and results were combined across imputed datasets using Rubin's rules.

Super-learner methods and supplementary results
The same variables considered in the regression models we considered as covariates in the nonparametric ensemble-based cross-validated learning, also known as Super-learning, and used to build an HIV risk score. The risk score is defined as the logit of the predicted HIV infection probability from a regression model estimated using the ensemble algorithm Superlearner, where this logit predicted outcome is scaled to have empirical mean zero and empirical standard deviation one.
Super-learning was implemented on each of the 100 imputed datasets. Seven different learners were included in the learner library: a mean model (no predictors), logistic regression, logistic regression with all two-way interactions between variables, logistic regression with lasso penalty implemented using glmnet, logistic generalized additive model implemented using gam, boosted logistic regression implemented using xgboost, and random forest implemented using ranger. All of the selected learners are coded into the SuperLearner R package available on CRAN. The learners all model the HIV outcome as binary and treat censored subjects as HIV-uninfected; this simplification is reasonable given the low HIV incidence, and binary outcome and censored data methods have been seen to produce similar results in other analyses of these data.
The learners were implemented with different approaches to variable pre-screening: all variables eligible for inclusion, including variables with non-zero coefficients in a lasso fit, including variables with univariate Wald test 2-sided p-values in logistic regression < 0.10, and selecting only one variable at random from amongst a pair of quantitative variables with pairwise Spearman rank correlation > 0.90. Supplementary Table S4 lists the learner-screen combinations that were considered (14 in total).
For each of the 100 imputed datasets, Superlearner was implemented after pre-scaling each quantitative and ordinal variable to have mean 0 and standard deviation 1. Two levels of cross-validation were used: 1) Outer level: a cross-validated AUC (CV-AUC) was computed over 5-fold cross-validation, and 2) Inner level: 5-fold CV was used to estimate weights associated with each learner in the ensemble. Results were summarized across the 100 imputed datasets using mean, median, and standard deviation.
The weights associated with each constituent learner are reported in Supplementary Table S5. The coefficients of each of the variables in each constituent learner are in Supplementary Table S6.
Classification accuracy of different models was measured using CV-AUC (Hubbard et al., 2016;Williamson et al., 2020) as estimated using the R package vimp available on CRAN. CV-AUC values for constituent learners and the Super-learner model are in Supplementary Table S7 .
To estimate the predictive ability of Superlearner on out-of-sample test data, Super-learning was also implemented on each of the 100 imputed datasets by splitting them randomly into 2:1 train:test sets. Each split was stratified by HIV infection status to ensure 2:1 representation of HIV cases in all train:test sets. The Superlearner model and all constituent learners developed using the training set were subsequently used to predict outcome probability on the test set, and AUC was used to measure performance. Median AUC of the risk score across imputed test sets was 0.688 [95% CI: 0.555 -0.778], which was comparable to CV-AUC. This suggested that the CV-AUC is a good estimate of out-of-sample performance.   Table  S6: Summary statistics (N, mean, median, standard error, and 95% CI) of the odds ratio of predictors in learners assigned weight > 0.0 by Superlearner in any of the 100 imputed datasets. Randomforest and xgboost results reported separately. N learner indicates number of datasets for which the weight was non-zero for the particular constituent learner. N predictor indicates number of datasets for which the weight was non-zero and the predictor was also given a non-zero estimate. Confidence i ntervals b ased o n 2 .5 a nd 9 7.5 quantiles.   Table S7: Summary statistics (mean, median, standard error, and 95% CI) of CV-AUCs illustrating performance of Superlearner and all learner-screen combinations from the imputed datasets (N=100) for risk score analyses using the full dataset set and HIV-1 status as outcome. Confidence intervals based on 2.5 and 97.5 quantiles from the CV-AUCs from the 100 datasets.