The Cluster of Difference 4 (CD4+) count is the most common indicator of health status and immune function of patients infected with the human immunodeficiency virus (HIV) [1]. Several CD4+ count covariates from different clinical platforms have been investigated in HIV-positive patients. The quest for understanding the behavioural patterns of the CD4+ count covariates has been due to different reasons ranging from their potential use as either cost-effective CD4+ count surrogates [2,3,4] or predictors [5,6,7] to pre-treatment assessment and monitoring of therapy in HIV-positive patients [8]. Such endeavours to keep abreast of the health status of HIV-positive patients in the absence of the CD4+ count were triggered by high costs of the CD4+ count diagnostic devices in the past [9, 10], making them not easily accessible to resource-limited settings in the developing world [11] where the health facilities are usually overburdened [12]. The challenge was exacerbated by operational and logistical issues [13, 14] in the supply of essential medicine for the patients [15,16,17] including frequent instrument breakdown and poor manufacturer maintenance of CD4+ count diagnostics [18]. Recently, obtaining the CD4+ count has become extraordinarily inexpensive [19,20,21] and, in the contemporary era of antiretroviral therapy (ART), the monitoring and restoration of patient’s CD4+ count to acceptable levels are now relatively easy [22] and have led to improved patient survival periods [23]. Despite the breakthrough in ART, recommendations have been made to suggest other factors that influence long-term CD4+ cell response in conjunction with the therapy [24]. As such, previous studies have been inclined more towards social, demographic and other categorical factors [25,26,27], which suffer from information loss due to their grouping nature [28, 29]. On the other hand, the “richer” continuous clinical covariates are more sensitive to sources of variation [30] in the CD4+ count and better capable of capturing and explaining realistic behavioural patterns of the CD4+ cell response in the face of the rapidly mutating [31] HIV that is known to attack the CD4+ cells [32]. The suggestion of the CD4+ count surrogates, predictors and pre-treatment assessment options in the past turned out to have the potential to be manipulated as drivers for influencing long-term CD4+ cell response in HIV-positive patients. For example, a close follow-up on sodium has been reported to improve outcomes [33] as it positively influences the CD4+ count and its early management was found to be a contributing factor to the survival rates of HIV-positive patients [34]. Among other CD4+ count clinical covariates, sodium and calcium levels are affected by dietary conditions [35, 36], which can become a potentially important integral part of the HIV treatment process during disease progression. Other blood chemistry components have also been suggested [5, 7, 33, 37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52] including CD4+ count covariates from other clinical platforms such as the full blood count [3, 5, 53,54,55,56,57], lipids [58,59,60], sugar [61,62,63] and clinical examination measurements [2, 6, 64,65,66,67,68,69,70,71]. It then stands to reason that endeavours to deal with the obstacle of expensive CD4+ count diagnostic devices in the past left a long trail of suggested continuous CD4+ count clinical covariates that have potential to be an important integral part of the treatment process during HIV disease progression. The list of such potentially manageable continuous CD4+ count clinical covariates has also grown over the past few years owing to the tremendously high volume of patient electronic health records that are being stored at a faster pace and relatively cheaper than in the past [72]. However, an evaluation to determine the strongest candidates of these continuous CD4+ count clinical covariates during HIV disease progression has not been well documented.

This bioinformatics study aimed to pool and evaluate the previously and independently suggested continuous CD4+ count clinical covariates to give an insight on the strongest drivers of the long-term CD4+ cell response in HIV-positive patients during the disease progression. Our goal was to shed more light on the possibilities of integrating and managing the continuous CD4+ count clinical covariates in the HIV treatment process. For example, ART is a major milestone in HIV treatment [73] but it is associated with side effects that can lead patients into challenging situations [74, 75]. Hence, the realisation of managing this continuous clinical covariate influence on the CD4+ cell response would potentially prolong the pre-treatment period and increase the likelihood of delaying patients in experiencing ART-related issues at an early stage of the disease progression. Some of the statistical tools previously used to assess the CD4+ count and covariate associations either were limited or suffered from information loss, for example, analysis of variance (ANOVA) [51, 76, 77], confidence intervals [64], t-tests [58,59,60], non-parametric tests [33, 38], chi-square tests [61, 78], linear regression [65, 79], sensitivity, specificity and positive prediction [2, 8] and correlation analysis [63, 66, 80]. As such, we also sought to pave the way for other areas such as predictive modelling with streamlined influential clinical covariates that are richer in information preserved in their continuous nature to explain the CD4+ count variation. We evaluated available measurements of the continuous CD4+ count clinical covariates routinely collected at the Centre for the AIDS Programme of Research in South Africa (CAPRISA).


The Study Design

The CAPRISA 002 enrolled 245 HIV-negative (phase 1: pre- HIV infection) female sex workers into an Acute Infection study. The establishment of the acute infection study, cohort screening and seroconverts; routine evaluation procedures; CAPRISA-participant interaction and data management have been previously documented [81]. The study protocol and informed consent documents were reviewed and approved by the local ethics committees of the University of KwaZulu-Natal, the University of Cape Town, the University of the Witwatersrand in Johannesburg and the Prevention Sciences Review Committee (PSRC) of the Division of AIDS (DAIDS, National Institutes of Health, USA). The study was also performed in accordance with the Helsinki Declaration of 1964 and its later amendments. The consent forms were translated into vernacular language, isiZulu, and written informed consent was obtained at each stage of the study. All minors under the age of 18 years were excluded from the study as part of the screening procedure. The HIV-negative cohort was followed up and upon HIV infection they were further followed up with weekly to fortnightly visits up to 3 months (phase 2: acute infection), monthly visits from 3–12 months (phase 3: early infection) and quarterly visits thereafter (phase 4: established infection) until ART initiation (phase 5). Eventually 27 seroconversions were recorded. In addition to the 27 seroconverts, 210 more patients who seroconverted from other CAPRISA studies were also enrolled and similarly followed up post infection from the acute to ART phase. Figure 1 summarises how the total sample size of 237 seroconverts for this study was obtained.

Fig. 1
figure 1

Study design. The HIV-negative cohort screening involved 775 voluntary potential candidates of which 462 were already HIV positive and 313 initially eligible. Of the 313 HIV-negative patients, only 245 were enrolled and the rest excluded for various reasons according to the eligibility criteria. Eventually 27 out of the 245 seroconverted were enrolled into follow-up care. Seroconverts from other CAPRISA studies (210) were also included into the follow-up care that resulted in a total of 237 patients for this study


Four time points prior to each phase transition were selected, which resulted in a total of 16 repeated measurements being investigated for each patient. The baseline (Phase 1) repeated measurements were scarce; hence, this study focused on phases 2 to 5 only. The CD4+ count covariates include: full blood count, lipids, sugar, blood chemistry and clinical examination. Several of these variables have been studied as potential covariates for the CD4+ count but mostly contested in isolation or within a small group of barely just under five variables confined within their respective clinical platforms. All the data sets from the different clinical platforms were pooled into a single data set.

Statistical Analysis

All the analysis was performed in the open-source R software, version 3.5.0. Firstly, a descriptive summary of the repeated measurements was provided using the function stat.desc in the pastecs library. Secondly, redundant features among the covariates were investigated using correlation analysis that dropped off the covariates with the highest mean absolute correlation using the findCorrelation function. Thirdly and last, this was then followed by the partial least squares (PLS) approach to model building with the application of the spls function in the mixOmics library, which is capable of handling the complex structure of repeated measurements. The package incorporates a design matrix to account for variation in the multilevel structure of the longitudinal data. PLS handles multicollinearity and a very large number of variables in longitudinal data. It ranks the covariates from strongest to weakest allowing variable selection and consequently dimension reduction. Since the PLS is a multidimensional analysis technique, graphical displays of the results were vital to comprehensively visualise the variable selection process with the aid of the instrumental R libraries: the ggplot2 and ggrepel.


Descriptive Statistics

Table 1 shows that throughout the follow-up care, the minimum and maximum CD4+ counts recorded were 45 and 1395 cells/mm3, respectively. During the follow-up period, at least 50% of the CD4+ count repeated measurements were above 539 cells/mm3 and averaging 571.14 ± 238.45 cells/mm3 with an overall variation of 41.75% around the cohort average. The greatest variation in the covariates was observed in eosinophils (101.11%), basophils (75.74%) and gamma glutamyl transferase (64.20%).

Table 1 A descriptive summary of the investigated variables

Redundant Feature Selection

Table 2 shows that haemoglobin (Hb), mean corpuscular haemoglobin (MCH), leucocytes, cholesterol, hip circumference, weight (kg) and body mass index (BMI) were highly correlated with the other covariates. The anthropometric measurements were the most highly correlated among themselves but the BMI, although marked as a redundant feature, was intuitively included in the second stage of variable reduction using the PLS.

Table 2 Redundant features: highly correlated (r > 0.75) covariates of the CD4 count

Variable Selection

The optimal principal component (Fig S1) explained 68.95% of the variance in the response (CD4+ count) and the variable selection simultaneously considered both the variable importance in projection (VIP) and regression coefficients (see Fig S2 for details). We presented all three VIP cut-off points where a cut-off point of 1.5 can be considered as a strict selection, 1.0 as moderate and 0.8 as lenient. A stricter variable selection process selected two covariates, the moderate (13) and lenient (18), of the 40 non-redundant features available for our study. We developed an interest in all the 18 strongest covariates as selected by the lenient cut-off point.

Figure 2 provides a list of all 40 covariates from the strongest to the weakest significance as well as their behavioural patterns in the predictive power (coefficients), component construction (loadings) and independent association (correlation) with the CD4+ count together with the associated p values. The covariate loadings and regression coefficients indicated more or less the same effects in component construction and predictive power, respectively. Among the significant covariates, folate, magnesium, calcium and sodium had the highest reducing effect on the CD4+ count, whereas alkaline phosphatase (ALP), mean corpuscular volume (MCV) and lactate dehydrogenase (LDH) corresponded to an increased CD4+ count. In this study, the lymphocytes had the highest direct independent positive correlation with the CD4+ (r = 0.5421, p < 0.0001) followed by haematocrit (r = 0.2337, p < 0.0001). On the other hand, protein had the highest negative correlation (r = − 0.1740, p < 0.0001) with the CD4+ count followed by folate (r = − 0.1530, p < 0.0001). The results showed that the top 8 of the 18 selected covariates were positively and independently associated with the CD4+ count. Of all the investigated 40 non-redundant covariates, red blood cell distribution width (RDW), pulse, urea, alanine aminotransferase (glutamate pyruvate transaminase) ALT(GPT) and axillary temperature were the least important.

Fig. 2
figure 2

Variable importance. Also shown are the related loadings, standardised regression coefficients and correlations of each covariate with the response variable (CD4 count)

A look at the significant variables by clinical category (Fig. 3) revealed that there was no significant variable selected from lipids, physical examination and anthropometric measurements. Folate was the only significant variable in its category and similarly alkaline phosphatase only among the liver function indicators. The PLS suggested chloride and RDW as the only insignificant CD4+ count covariates among the electrolytes and red blood cells, respectively. Given the lymphocytes, basophils and monocytes, the significant covariates within the white blood cells group, the lymphocytes were dominantly significant. Generally, most of the significant CD4+ count covariates were selected from electrolytes, proteins and red blood cells. The data for all the variable selection plots are given in File S1.

Fig. 3
figure 3

Variable importance by clinical category. The broken red horizontal lines divide the major groups. From the top, the groups are clinical examination, blood chemistry, sugar, lipids and full blood count. The horizontal broken grey lines divide the subgroups within the major groups


In the present study, we evaluated a list of continuous CD4+ count clinical covariates that were available at CAPRISA to determine the strongest candidates that can potentially become an important integral part of the HIV treatment process. The HIV targets and kills CD4+ cells resulting in the CD4+ count being an important outcome indicator for the patient’s health status. ART is known to supress the viral load and consequently an increased number of CD4+ cells are spared giving rise to an improved immune system [73]. Hence, during the HIV treatment phase, ART is a major determinant of the CD4+ count distribution. The intention of this study was to select the continuous clinical covariates that contributed to the greatest variation in the CD4+ count from an overall perspective throughout the post-HIV period including ART. We used the PLS approach to achieve this and variable reduction is possible [82] given the long list of covariates under study. The PLS also handles the variation in the multilevel structure of the data. The evaluated covariates were already known to be associated with the CD4+ count based on other statistical methods that were limited in some way or suffered from information loss due to grouping and details given in the introduction section. The predictive nature of the selected continuous covariates was beyond the scope of this work as our focus was on variable selection yet paving the way for such areas as predictive modelling with streamlined and richer continuous CD4+ count clinical covariates. In this discussion we provided a brief summary of the functions of the selected and strongest 18 (out of 46) covariates according to our PLS model to point out the direction for future studies on the feasibility of incorporating them in the HIV treatment process to influence long-term CD4+ cell response especially in an attempt to prolong the pre-treatment period and hence the likelihood of delaying the patients from experiencing the ART side effects, although the covariates can still be influential in the long-term CD4+ cell response during therapy as previously reported [33, 34]. On our list of selected continuous clinical covariates, the lymphocytes were the strongest, as expected, because the CD4+ cells are a T cell type [83] whereas the lymphocytes are either B or T cells [4, 56, 84]. Our results also showed the lymphocytes to have the highest independent positive correlation with the CD4+ count (r =0.5421, p < 0.0001). Hence, efforts to improve the CD4+ cell response seem to be similar to those for the lymphocytes and the results obtained hereby serve to give an assurance of the effectiveness of our statistical methodology. In light of the other selected variables, our results showed the need to pay much attention to the white blood cells (basophils and monocytes) and platelet count. Basophils and monocytes control damage to body tissues and inflammation and fight pathogens, respectively [84]. Platelet count measures the blood clotting condition [84,85,86,87]. Although they are the least abundant leucocytes [88], our study has found basophils to explain the greatest variation in the CD4+ count following the lymphocytes. However, the direct contact between human basophils and CD4+ T cells is known to mediate viral trans-infection of T cells through the formation of viral synapses [89, 90]. Also, the presence of basophils and other white blood cells in the blood is affected by underlying infection [91]. Areas of potential consideration in the blood chemistry group included potassium, sodium, calcium, magnesium, ALP and folate. Potassium regulates the acid-base chemistry and water balance [92], nerve impulses and heart muscle [84, 85]. Potassium's effect on the CD4+ count is affected by underlying comorbidities [93]. Sodium and calcium regulate the water balance, blood pressure, blood volume, heart rhythm and most importantly the brain and nerve function [84, 85, 92]. Changes in the sodium concentration are known to create an osmotic gradient between the extra- and intracellular fluid in cells [94] suggesting that a proper balance is essential. Magnesium is involved in muscle contractions and protein processing [84], ALP in detecting liver health [1, 95, 96] and folate for cell growth and metabolism [97, 98]. Red blood cells indices [haematocrit, MCV, mean corpuscular haemoglobin concentration (MCHC) and red blood cells] are related to haemoglobin [99], which binds oxygen for transport to tissues and binds tissue carbon dioxide to transport it back for exhalation [100, 101]. The indices indicate the volume, concentration and proportions of red blood cells [101, 102]. Because volume contributes to the haematocrit, dehydration becomes a confounder of the CD4+ count relationship. Details on patient dehydration were not available and this has not been taken into consideration in this study. In line with the red blood cell indices, our results revealed that LDH also needs attention. LDH is a cytosolic enzyme that enables the fulfilment of short-term energy requirements in the absence of sufficient oxygen at the expense of a greater consumption of glucose cells [103]. Proteins (total protein, albumin and LDH) were included in the selected list for the maintenance of normal water distribution between the tissues and blood as well as acid-base balance [104].

It is important to acknowledge that there were some limitations to this study. Several variables that influence the clinical covariates may not have been included, for example, dehydration, underlying infection, comorbidities and patient dietary conditions, especially their effect on the biochemistry covariates. These are potentially important confounders that could have been adjusted. Furthermore, the study findings were limited to adult females. We recommend future studies to consider the effect of gender and age on the strongest CD4+ count covariates during HIV disease progression. Given a large enough sample size, evaluating the clinical covariates for subjects with CD4+ count < 250 cells/mm3 is also recommended owing to the key driver for prophylaxis and surveillance for opportunistic infections related to CD4+ count < 250 cells/mm3.


Only a few of the many clinical attributes routinely collected during the HIV disease progression were found to be strong CD4+ count covariates and mostly from electrolytes, proteins and red blood cells. Prolonging the pre-treatment period of the HIV disease progression by effectively incorporating and managing the covariates for the long-term influence on the CD4+ cell response has the potential to delay the challenges associated with ART side effects. Damage to body tissues and inflammation as indicated by basophils was found to be the strongest CD4+ count covariate to effectively incorporate and manage for long-term influence on the CD4+ cell response. Lipids, physical examination and anthropometric measurements are not worth considering as important drivers of the CD4+ count when monitoring the health status of HIV-infected women during disease progression. There is a possibility of resource optimisation by streamlining the amount of routinely collected information when monitoring the health status of HIV-infected patients during the disease progression using just a few of the clinical attributes that strongly co-vary with the CD4+ count.