Worldwide, the HIV/AIDS epidemic is still a problem. It is estimated that currently, 37million people are living with HIV (PLHIV), with 70% of these in sub-Saharan Africa [1]. The estimated HIV prevalence is usually obtained from nationally representative, population studies such as demographic health surveys (DHS). However, surveys often have a problem of missing data, which can be a source of bias and can reduce study precision [2].

Accurate HIV prevalence estimates are important for monitoring and evaluating the ongoing programs, for the prevention and treatment of HIV and the allocation of resources within countries [3]. The available literature and guidelines on reporting observational studies(STROBE) suggest that for results to be efficient, the amount of data missing and methods used for handling the problem must be reported [4, 5]. The STROBE guidelines go further and explain the importance of reporting the reasons for missingness, which may include unit non-response, where a study participant or household are missing from the entire study, or item non-response, where some questions are not responded to, or wrongly entered in the database. The common reason for missing data in HIV studies includes the refusal to test or non-response to the survey [3, 6]. However, few studies report the proportion of missing data or even fewer describes the methods used to adjust for missing data [7].

Most of the published articles for estimating the prevalence and incidence of any diseases are based only on the use of complete case data analysis or available case analysis [8]. A few of the articles describe ad hoc methods such as the use of dummy variable and mean imputation for the estimation of disease prevalence and incidence. And even fewer articles describe more advanced methods for adjusting for missing data, such as inverse probability weighting, instrumental variables and multiple imputations [7, 9].

Many demographic and cross-sectional surveys have been conducted to estimate HIV prevalence and have been reported in peer-reviewed journals, but few recognise the bias that could be present from missing data. Editors and authors need to consider how these estimates have been obtained and how missing data have been addressed. It is important that advanced methods to adjust for missing data are incorporated in the analysis of HIV survey data to reduce the bias in the estimates. Failure to adjust for missing data may result in biased estimates of parameters of interest and can have a negative impact on controlling the epidemic [9]..

This study aimed to conduct a review of articles from HIV surveys with missing data to determine what analytical methods or techniques have been used during, estimating HIV prevalence. Also, to identify the methods used for sensitivity analysis to assess the robustness of the assumptions used.


Two guidelines were used during the conducting and reporting this review, the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) [10] and Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) [5].

Eligibility criteria and search strategy

An information specialist searched five different databases on 13th August 2018. The database list included Medline via PubMed, Web of Science Core Collection, Latin American and Caribbean Sciences Literature, Africa-Wide Information and Scopus. (Additional file 1).

Studies published from population surveys, either demographic or cross-sectional studies from January 2000 to August 2018 on estimating the prevalence of HIV/AIDS written in English were eligible to be included in the review. All articles had to include a statement or paragraph on how missing data or non-response was handled during analysis in the abstract.

Study selection procedure

All potential studies were imported into Covidence screened for their titles and abstracts to identify the relevant studies (Covidence systematic review software, Veritas Health Innovation, Melbourne, Australia. Available at Two independent reviewers applied the pre-specified criteria to select abstracts and to reject abstracts that are not relevant, with a third reviewer acting as a tiebreaker. Full text of all selected abstracts were obtained and assessed against the eligibility criteria. Disagreements were resolved through discussion between the two reviewers and the third reviewer.

Data extraction and risk of bias assessment

Before data extraction, all studies were assessed for the possibility of bias using a tool adapted from Hoy et al. .2012 [7, 11]. The Hoy tool has been designed to assess the risk of bias in population-based prevalence studies; it comprises of 10 domains which allow us to identify the study included if it has a low or high risk of bias. The items include a question that assessed the internal validity on the representativeness of the national or target population, sampling strategy used, the likelihood of non-response and question that assessed the external validity on how data were collected and analysed, reliability and validity of the estimates(Additional file 2). We used Kappa statistics to assess the agreement between the two reviewers on the full text studies included. The values where set as ranges of 0 to 0.20 as slight agreement; 0.21 to 0.40, fair agreement; 0.41 to 0.60, moderate agreement; 0.61 to 0.80, substantial agreement; and greater than 0.80 almost perfect agreement.

A piloted data extraction form with structured questions was used to collect data from the included studies independently by the two reviewers. We collected data on year of publication, place of study, type of study, sample size and if adjusted for missing data, how the outcome of interest was analysed, primary analysis and methods used to adjust for missing values. Discrepancies were discussed and resolved; an external reviewer was invited in if the consensus was not achieved from the two reviewers. The data extraction tool used is included as Additional file 3.

Data analysis

The extracted data were analysed through a quantitative approach. All the variables collected were described and summarised using flow chart and tables. The characteristics of individual studies included were described. Proportions of studies that reported missing values and the methods used to adjust for missing data or selection bias were summarised in the following way. Methods used for analysis were also described and, any other studies that performed sensitivity analyses for any of the methods were also quantified.


A total of 3426 citations were identified, 194 duplicates removed, 3232 screened, and 69 full articles obtained. The excluded abstracts were not surveys, or were not estimating HIV prevalence, or did not include any missing data methods to estimate HIV. Following full-text eligibility assessment, 24 studies were included while 45 studies were excluded due to not being a survey [12], not measuring HIV prevalence [13], being a methodological study [8], having no missing data methods used during analysis [3], duplicates [3] and 1 study where we could not assess the risk of bias, as it did not show the adjusted HIV prevalence after using the advanced methods for missing data. Table 1 shows the details of the excluded studies and a flow chart of the systematic review is provided in Fig. 1.

Table 1 Excluded studies and reasons for exclusion
Fig. 1
figure 1

A PRISMA flow diagram on the search and selection of studies process

Description of included studies

Out of the24 studies, 12 (50%) were Demographic Health Survey (DHS) studies [48,49,50,51,52,53,54,55,56,57,58,59,60], Seven (29%) Cross-sectional surveys [52, 61,62,63,64,65,66], three (13%) population surveys [67,68,69] and 2(8%) a mixture of Demographic Health Survey and Aids Indicator surveys [50, 70]. These studies were published between 2006 to 2018, and more than 95% of the studies were done in sub-Saharan Africa. The age of the participants ranged from 12 to 64 years, with more women than men participants. Table 2 provides a summary of 10 of the included studies which used a single, unique source of data, and did not use DHS data.

Table 2 Description of included studies which used only one source of data

Fourteen studies had multiple sources of data that were analysed. Whereby in other studies datasets were used more than once. All these studies used DHS data from different countries in Sub-Saharan Africa. The most common data set used was from Zambia DHS (2007) and Zimbabwe DHS (2006). A study by Marino et al. used more datasets than any other study (28/32) followed by Hogan et al. (27/32) and Mirsha et al. (14/32). Table 3 shows the intersection of data usage from the 14 studies with multiple sources of datasets, including DHS data.

Table 3 Display of multiple datasets usage

Risk of Bias assessment

The overall Cohen’s kappa coefficient statistic for the two authors screening all the included studies was estimated to be 0.93. We had a higher risk of bias on domains that assessed the internal validity of the studies compared to domains assessing external validity. Almost all studies had a higher risk of bias on Domain 4 which looked on likelihood of non-response (23/24), followed by Domain 1 which looked on the target population is a close representation of the national population (10/24) (Appendix 4). Only one study had a high risk of bias in terms of domains that looked on external validity (domain 8), which asked if the same mode of data collection was used for all subjects. Additional files 2 and 4 shows in detail all the domain assessed, and results of the assessment done.

Characteristics of the missing data

Only 21 of the 24 studies reported the response rate for an HIV test. It ranged from 32 to 96%. All the studies gave a reason for the missing data reported, major reason being the participant refused to consent to an HIV test and 8 (33%) studies identified further missing data from unit-nonresponse Six (25%) studies reported missing data as a separate outcome, while only 9 (38%) had a result table comparing the participants with complete data and the ones with missing data. Table 4 provides a summary of the mentioned characteristics.

Table 4 Summary of the missing data characteristics (n = 24)

Analytical methods used in the analysis

All the 24 studies included in the analysis used complete case analysis method as their primary method of analysis. Multiple imputations 11(46%) was the most advanced method used to adjust for missing data followed by the Heckman’s selection model 9(38%). Single Imputation and Instrumental variables method were used in only two studies each, with 13(54%) other different methods used in several studies. Ten studies (42%) applied more than two methods in the analysis, with a maximum of 4 methods in two studies. Table 5 describes the methods used to adjust for missing data on estimating HIV prevalence.

Table 5 Missing data methods used in the analysis

Only 1 study mentioned the pattern identified of the missing data, while more than half 13(53%) of the studies stated the mechanism assumed in the analysis. Of the 13 studies that mentioned the mechanism used during analysis, all studies assumed data to be MCAR for the complete analysis, 11 assumed data to be MNAR, ten assumed data to be MAR and seven studies assumed both MAR and MNAR. For the studies that used Multiple imputation method, only 3 (27%) stated the number of imputed data sets in the analysis, but seven (64%) mentioned the variables used in the imputation model. On assessing the robustness of the results only 6(25%) studies conducted a sensitivity analysis, while 11(46%) studies had a significant change of estimates after adjusting for missing data. Table 6 provides details on the different aspects of the analysis strategy and methods.

Table 6 Further information on the analysis and results conclusion provided


We identified 69 citations that fulfilled our eligibility criteria on this HIV topic with only 24 studies addressing the missing data problem on the estimation of HIV prevalence during analysis. The same trend of fewer studies addressing the missing data problem is observed in other design like clinical trials and HIV longitudinal studies measuring different outcome [72]. The major reason for the missingness was reported to be a refusal to consent for an HIV test, and with complete case analysis be the primary method of analysis used. Multiple imputations and Heckman’s selection models were the major methods used to adjust for missing data, with 46% of studies showing a significant change of estimates after adjustments. Only a quarter of the included studies conducted a sensitivity analysis to assess the robustness of the results.

There was a good agreement between authors regarding the risk of bias, for all the included studies we had a high risk of bias on the domains assessing the internal validity of the studies compared to domains assessing the external validity, i.e. on the likelihood of non-participation. This may be because one criterion for the inclusion to the review was the study should have a line addressing the missing data or non-response problem.

The STROBE guideline [5] recommends that authors to report the amount of missing data, methods of handling missing data and the reasons for missingness s. However, of all included studies, only one was published before the STROBE guidelines in 2007, while others were published afterwards, and we found out that in most of the included studies provided the amount of missing data, with the corresponding reasons for missingness however, very few studies explored the differences between the participants with complete data and with missing data which can be used as the bases of examining the MCAR assumption.

The included studies used different methods for missing data analysis, and these ranged from ad hoc (complete case and single imputation) to advanced methods assuming MAR or MNAR mechanism (e.g., multiple imputations). Multiple imputations were the common method used despite that in most of the studies the methodology behind it was not clearly explained like the algorithm followed during imputation, number of imputed dataset and details on the imputation model. Provision of this information helps the replication of the methods and assessment of the results.

We observe an increase of the HIV prevalence estimates after adjusting for the missing data, demonstrating the presence of downward bias if complete case analysis is used The differences were significant in some studies [58, 71], and this suggests there might be underestimating of HIV prevalence if missing data are ignored.

All the applied methods had the shortcoming of its application considering the mechanism followed since there is no proof that missing data were MAR or MNAR. Heckman’s selection models and application of instrumental variables where the methods tried to explore the deviation of MAR to the possibility of MNAR assumption although a lack of suitable selection or instrumental variable impacts their applicability [57, 71]. The use of doubly robust methods and extension of Heckman’s selection models are the current methods identified as suitable when data are assumed to be MNAR. With the assumption that the missing data on HIV prevalence studies not being MAR, and the possibility of MNAR [54, 68], it is important to explore more methods than identified from this review.

Further to the analysis, a report from National Research Council (NRC) [73] explains the importance of conducting sensitivity analysis to assess the robustness of the results and conclusion of the assumptions used on the application of methods used to adjust for missing data. However, Only a quarter of the included studies performed a sensitivity analysis…. This does not differ with results provided by other reviews on missing dat, that very few studies assessed the robustness of the results regardless of the design [74, 75].

This is the first systematic review exploring the methods used in addressing the missing data problem on estimating HIV prevalence, however these results can only be generalizable to studies where the focus is on missing data This review will guide us in future application of these methods on real datasets from a population-based study conducted in North-West Tanzania and estimate the amount of bias caused by the missing data. Also, we will extend the methods assuming data being MNAR with further assessment by using a sensitivity analysis approach.


This review aimed to look at surveys to determine what analytical methods or technique have been used to address the missing data problem on estimating HIV prevalence. From the studies included we saw that several methods can be used when data are not missing completely at random,. However, studies often report very little information on the steps, theories, assumptions and sensitivity of the reported results. .

All methods used for handling missing data in the included studies produced different estimates from the primary analysis, and in some studies, the difference was large. These differences highlight the need for considering using more advance methods when facing the problem of missing data in surveys and population studies to avoid producing biased results.

A further extension of this work is needed to compare the effectiveness of the estimates, and the amount of bias remaining from the available methods for dealing with missing data. Awareness is an important aspect of ensuring that these methods are applied appropriately, and the right choices are made considering the reasons, patterns and mechanism of the missing data..