Background

Substance use, including illicit drug use and alcohol, is prevalent worldwide with about 5% of adults using illicit substances [1] and 40% of adults consuming alcohol, in the past year [2]. Moreover, the number of people with drug use disorders was estimated at 62 million, while the number of individuals with alcohol use disorders was estimated at 100.4 million in 2016 [3]. Substance use disorders are associated with substantial morbidity and mortality globally. Illicit drug use disorders were attributed to 20 million disability-adjusted life years (DALYs) lost [4] while alcohol use disorders were attributed to 85 million DALYs lost in 2012 [5]. Specific classes of substances also play an important role in HIV risk, including needle sharing, and sexual risk behaviors, and have been linked to HIV incidence [6,7,8] [6, 9,10,11] [12,13,14,15]. Among people living with HIV (PLWH), substance use disorders may lead to less optimal HIV care outcomes because of their associations with lower likelihood of being linked to HIV care, retained in care, receiving antiretroviral therapy (ART), having high ART adherence and lower likelihood of having an undetectable HIV viral load [9, 10, 16,17,18].

Given the role of substance use in the global burden of disease and the overlap between use of specific substances and HIV, it is important for clinicians and researchers to have tools with high reliability, validity, and diagnostic accuracy [19]. Yet too few use measures with known psychometric properties when assessing substance use. Currently, there are a myriad of standardized questionnaires used to screen substance use and misuse that require patients to self-report patterns of use and substance-related problems. Examples such as the Alcohol Use Disorders Identification Test and the Drug Use Disorders Identification test [20, 21] provide scores that correspond with severity of substance use and related problems. It remains that there are no biological measures that define a substance use disorder; existing biological measures are considered to be indirect correlates of use disorders [22]. Examples include alcohol biomarkers like Carbohydrate-Deficient Transferrin (CDT), and Gamma Glutamyl Transferase (GGT), which are used to screen for alcohol dependence and heavy drinking, respectively [22]. There is a great need to evaluate the psychometric performance of these measures and markers across studies in settings of HIV to elucidate the overall validity, reliability, and diagnostic accuracy.

One approach to informing the use of psychometric measures in research and clinical care is pooling the psychometric characteristics of measures across studies involves the use of meta-analytic techniques, which generates summary estimates of the validity, reliability, and diagnostic accuracy of different questionnaires [23,24,25,26,27]. However, synthesis of psychometric properties of substance use measures to identify patterns of use and substance use disorders remains limited, with few exceptions [21, 28, 29]. One meta-analysis focused on the accuracy of self-reported assessments to diagnose alcohol and cannabis use disorders found that instruments had a pooled sensitivity of 0.88 and a pooled specificity of 0.90 among emergency room department pediatric patients [28]. Another meta-analysis observed that studies with single questions to identify alcohol use disorders in primary care had pooled sensitivity of 0.54 and pooled specificity of 0.87 while two-question measures had a pooled sensitivity of 0.87 and a pooled specificity of 0.80 [29]. More commonly, however, reviews on substance use measures present psychometric data in a descriptive fashion [19, 30, 31]. Therefore, more rigorous efforts to systematically pool the psychometric properties of substance use measures are needed to establish the overall performance and accuracy of these tools and point toward their utility in future research.

To address these gaps, we conducted a systematic review and meta-analysis of literature to identify studies that have reported validity and reliability of substance use measures and pooled these measure using meta-analytic techniques. For the purposes of this review, we targeted our search for measures of substance classes previously associated with HIV risk. Specifically, we focused our review on measures for the following: alcohol, methamphetamine and amphetamine, cocaine, heroin, and ecstasy, regardless of whether the study was conducted among a population at high risk for HIV. Additionally, we included measures that evaluated substance use in general (i.e., measures that did not differentiate between classes of substances) as long as those measures were inclusive of our targeted substance classes. This study’s review questions are: What are the summary reliability, validity--as measured by alpha and kappa coefficients—and diagnostic accuracy—as measured by sensitivity, specificity, positive predictive value, and negative predictive value—of various substance and alcohol measures to screen for use and use disorders?

Methods

Search strategy

We conducted a systematic review of studies published prior to June 2016 on substance use measures indexed in electronic databases including PubMed, PsycINFO, and EMBASE. We developed Boolean search terms to capture substance use measures that have been previously associated with HIV risk, in consultation with the reference librarian from the University of California San Francisco with a master’s degree in library and information science (MLIS). The following substance classes were included: alcohol, methamphetamine and amphetamine, cocaine, heroin, and 3,4-methylenedioxy-methamphetamine (MDMA; “ecstasy”). Because the focus of this study was to pool psychometric properties of measures, we also included search terms related to validity, reliability, and diagnostic accuracy (i.e., alpha, kappa, sensitivity, specificity, positive predictive value, negative predictive value). Search terms included MeSH headings related to our research question, general terms related to substance use and psychometric properties or interest, as well as specific terms referencing the names of well-known substance use measures. The search terms used are provided in the appendix. This review was registered in Prospero, the International prospective register of systematic reviews (study number: CRD42017058813).

Primary outcomes

We aimed to estimate the pooled summary estimates for the following psychometric outcomes: Cronbach’s alpha, kappa, sensitivity, specificity, positive predictive value, and negative predictive value. We recognize that there are a number of measure characteristics that relate to validity [32]. However, to focus our review and facilitate the feasibility of completing this study, we have decided to restrict the scope of our validity measures to Cronbach’s alpha. Descriptions for these outcomes are provided below:

Psychometric Outcome

Description

Cronbach’s alpha

measure of internal consistency, that is, how closely correlated a set of scale items are, as a group.

Kappa

measure of inter-rater agreement or inter-rater reliability for qualitative (categorical) items which takes into account the possibility of the agreement occurring by chance.

Sensitivity

measure of a test/scales’ ability to correctly detect patients who do truly have the condition (i.e., proportion of people who screen positive for substance use disorders according to the scale, among those who truly have substance use disorders based on an established standard (“gold standard”) such as meeting diagnostic criteria for a disorder).

Specificity

measure of the test/scales’ ability to correctly detect patients without a condition (i.e., proportion of people who screen negative for substance use disorders according to the scale, among those who truly do not have substance use disorders based on an established standard such as meeting diagnostic criteria for a disorder).

Positive predictive value (PPV)

the probability that persons with a positive screening result actually has the disorder. (i.e., proportion of people who meet diagnostic criteria for a substance use disorder among those who screened positive for the disorder on a scale).

Negative predictive value (NPV)

the probability that people with a negative screening test actually do not have the disease. (i.e., proportion of people who meet diagnostic criteria for a substance use disorder among those who screened negative for a substance use disorder in a scale).

Eligibility criteria

We searched for relevant publications that met all of the following inclusion criteria: 1) studies that reported one or more of the psychometric outcomes of interest; 2) studies that examined on one or more substance use measures related to our substance classes of interest (i.e., alcohol, methamphetamine and amphetamine, cocaine, heroin, and ecstasy) or for substance use in general (i.e., some measures do not differentiate between multiple substances or assess classes of substances all together); 3) publication written in English (note: studies that administered measures that were not in English were eligible as long as the publication was written in English) .

We excluded publications using the following exclusion criteria: 1) reporting insufficient information on reliability, validity and diagnostic accuracy for substance use measures/assessments (i.e., no numeric information on our psychometric outcomes, sample size); 2) articles that provide psychometric data for a measure/assessment that is not related to substance use (e.g., a study on internal consistency data on a depression scale among substance users); 3) articles and/or secondary data analyses that report reliability and validity data from a primary outcome paper that was already included in the review; 4) reviews, commentaries, case report studies and other publications with insufficient reporting of data; 5) substance use measures/assessments that focus on aspects other than actual substance consumption, dependence or substance use disorder (e.g., a study reporting validity of a self-efficacy scale for resisting substance use; a study that examines the underlying mechanisms of substance use among those who already have a substance use disorder); and 6. studies with psychometric properties that focus on substance classes outside the scope of our review (e.g. marijuana or tobacco).

Screening procedures

All citations (including their titles and abstracts) captured by the search strategy were imported into Covidence.org (Melbourne Victoria), which allowed research team members to independently review and screen citations using a centralized, online database. Each title/abstract was screened by two members of a team comprising master-, doctoral-, and post-doctoral-level researchers trained in the study protocol (co-authors PP, DH, RC, DS, CM, PM, and FC) and citations that were coded as eligible by both reviewers were moved to the full-text review phase. The same process was then repeated for full-text articles. In the event of discrepancies between reviewers in both the title and abstract phase and the full-text phase, a third team member (GMS) reviewed the relevant documents and helped reconcile the differences. Articles that were deemed eligible in the full-text review stage were included in the data extraction phase described below.

Data extraction

Team members extracted data on the psychometric properties, scale and study characteristics, sample size, study sample characteristics/co-factors of interest (country where study was conducted, number of sites, language that the scale was administered, gender of participants included), cut-offs used, comparison measure/gold-standard used, and other information relevant to study, including information on study quality [33]. Some papers reported multiple data points for psychometric outcomes from different study populations (e.g., disaggregated data by sex or different research sites). These data points were extracted as separate records only if the paper did not provide a single overall measure for the psychometric outcomes for the entire study sample, consistent with other analyses [24].

Assessment of bias risk

For studies reporting diagnostic measures (e.g., sensitivity and specificity), reviewers rated study quality using the Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies, QUADAS-2, guidelines [33], which includes quality rating questions on the study’s patient selection, index test, reference standard, and flow and timing. For studies that did not include diagnostic accuracy measures, only relevant domains of QUADAS-2 were assessed, as appropriate (i.e., rating regarding the reference standard was not conducted). All extracted data were entered into an electronic questionnaire programmed in Qualtrics, and checked by another researcher (conducted by the same co-authors who screened citations, as well as co-author BK) to verify accuracy.

Data analyses

We calculated separate pooled summary estimates for each of the 37 substance use measures and also fitted separate models for each of the six psychometric outcomes for validity, reliability, and accuracy. For alpha, kappa, PPV and NPV, we pooled data across studies using DerSimonian-Laird random effects models, implemented in STATA version 13 (Colleges Station, TX) [34]. Random effects meta-analyses models, as opposed to fixed-effects models, are preferred for pooling data from diagnostic accuracy tests since heterogeneity is presumed to exists across these studies [35]. Random effects models, which are considered the default models used in meta-analyses for diagnostic accuracy tests, synthesize the psychometric outcomes from separate studies into a weighted average effect size (pooled summary estimate), using inverse variance weighting, based on sample size, while taking into account the extent of the variability of the effect sizes observed in separate studies [35]. Additionally, for sensitivity and specificity, we used hierarchical logistic regression models, implemented using the metandi command in STATA, to account for the correlation between the two measures (i.e., trade-off between sensitivity and specificity) [36,37,38]. Since metandi requires a minimum of four observations to conduct a meta-analysis, we pooled measures with less than four records for sensitivity and specificity outcomes using the random effects models described for other outcomes, and noted this alternate approach in the results, as appropriate.

Classification and evaluation of pooled estimates

Qualitatively, pooled summary estimates for alpha and kappa were classified as “excellent” for estimates that were > 0.89, “good” for estimates that were between 0.85–0.89, “moderate” for estimates that were between 0.80–0.84, “fair” for estimates that were between 0.75–0.79, or “unsatisfactory” for estimates below 0.75, consistent with other studies [24, 39].

Pooled summary estimates for sensitivity, specificity, positive predictive value and negative predictive value were classified as “excellent” for estimates that were > 0.89, “good” for estimates that were between 0.8–0.89, “moderate” for estimates that were between 0.6–0.79, and “low” for estimates that were < 0.6 [24, 40].

For each pooled psychometric summary estimate, we calculated I2 statistics, which represents the percentage of total variation across studies, to assess heterogeneity. We considered pooled estimates as having low heterogeneity if I2 25%, moderate heterogeneity if I2 50%, and high heterogeneity if I2 75% [41]. We did not use standard meta-analyses tests for publication bias given the limitations of these tests for diagnostic test accuracy studies and due to the characteristics of our psychometric outcomes (e.g., truncated measures cannot fall below zero) [42]. As indicated in the Cochrane Handbook for Systematic Reviews of Diagnostic Test Accuracy, using these tests are inappropriate because they will likely lead to a high false-positive rate for publication bias [35].

Results

Screening and study inclusion

Study screening and inclusion is summarized in Fig. 1. In brief, in the identification stage, we initially identified 7555 references in the initial search, of which, 208 were excluded for being duplicates. In the title and abstract review phase, reviewers excluded 5854 studies that were deemed ineligible. Full-text reviews were conducted for 1493 articles that were deemed eligible from title and abstract review. Of the full-text reviewed articles, 1105 studies were excluded for not meeting eligibility criteria. The most common reasons for exclusion were: scales or measures that were outside the scope of review (n = 386), lack of psychometric data on scales of interests (n = 140), lab or methods papers that were outside the scope of the review (n = 130), non-English language publications (n = 110), duplicate study (n = 98), psychometric outcomes that were outside the scope of review (n = 79). In total, there were 387 unique studies included in the data extraction phase containing sufficient data on the outcomes for 37 scales (Table 1).

Fig. 1
figure 1

Study Identification, Screening, Eligibility, and Inclusion for Meta-Analysis

Table 1 Substance use Measures/Scales identified in Systematic Review and Meta-analyzed

Study characteristics

Table 2 presents characteristics of the studies included in this meta-analysis. As mentioned, studies published in English were included in this review, regardless of the language in which the scales were administered. Among the 387 studies included, the most those common language in which the scale/measure was conducted in was English (63%), followed by Spanish (9%), French (5%), Portuguese (3%), and Chinese (2%). A large proportion of studies were conducted in the United States (40%). The median sample size was 286 [Range = 9–50,049]. The vast majority of studies (83%) included men and women (n = 323). Additionally, 11% (n = 42) of the studies included study sample comprised only of men, while 5% (n = 20) studies included study samples comprised only of women. Most studies were published after 1999 (66%), with studies published between 2000 and 2009 accounting for 38% (n = 148) of the studies meta-analyzed, and studies published between 2010 and 2017 accounting for 28% (n = 110). Most studies involved a single study site 61%, while 39% were multi-site studies. Additionally, 72% of the studies involved convenience samples, 20% included random or probability based samples, and 7% had other or unclear sampling strategies.

Table 2 Pooled Summary Estimates

Assessment of bias in study quality

The risk of bias in the four QUADAS 2 domains for each study included in this meta-analysis is presented in Supplementary Table 1. The distribution of the QUADAS 2 domains for the entire study is summarized in Fig. 2. Of the studies included, 58% of studies had a low risk of bias with respect to the patient population; 57% has low risk of bias in the index test domain, 48% has low risk of bias in the reference standard test domain, and 72% had low risk for the flow and timing. Overall, only 16% of studies had low risk of bias across all four of these QUADAS 2 domains.

Fig. 2
figure 2

Overall Summary of study quality ratings from the Revised Tool for the Quality Assessment of Diagnostic Accuracy Studies, QUADAS-2

Pooled summary estimates: overall findings

The pooled summary estimates of psychometric properties of substance use measures (which are described in Table 1) are quantitatively and qualitatively summarized in Tables 2 and 3, respectively. Overall, 65% of pooled estimates for alpha were in the range of fair-to-excellent; 44% of estimates for kappa were in the range of fair-to-excellent. In addition, 69, 97, 37 and 96% of pooled estimates for sensitivity, specificity, positive predictive value, and negative predictive value, respectively, were in the range of moderate-to-excellent (Fig. 3).

Fig. 3
figure 3

Distribution of Pooled Summary Estimates of Psychometric Outcomes

Table 3 Qualitative Interpretation of Pooled Estimates

Self-reported measures that had all pooled estimates that were fair/moderate or better include the following: Alcohol Dependence Scale; Addiction Severity Index (ASI); ASI subscale for Alcohol; ASSIST; the Composite International Diagnostic Interview, including the original version, as well as version 2.1 and version 3; Drug Abuse Screen Test - 10 item scale; Drug Use Disorders Identification Test; Problem Oriented Screening Instrument for Teenagers; Severity of Dependence scale; Timeline Followback; and Chemical Use, Abuse, and Dependence. Biomarkers that had all pooled estimates that were fair/moderate or better include the following: Ethyl glucuronide; Phosphatidylethanol test; and the combined used of Carbohydrate deficient transferrin and Mean corpuscular volume. In general, we also observed high heterogeneity between studies for most pooled estimates.

Pooled summary estimates, by substance use measure

The pooled estimates and 95% confidence intervals for alpha, kappa, sensitivity, specificity, positive predictive value, and negative predictive value are shown in Table 2, respectively. Below we summarize the results of the pooled summary estimates alphabetically for each of the 37 substance use measures, grouping self-reported measures and biomarkers separately. The list of references for the studies meta-analyzed for each scale/measure is presented in Supplementary Table2.

Self-reported measures

Alcohol dependence scale (ADS)

The pooled alpha estimate for ADS (3 data points) was good: 0.90 (95%CI = 0.80–0.99) and there was high heterogeneity between studies (I2 98.9%). The pooled sensitivity estimate for ADS (2 data points) was excellent: 0.95 (95%CI = 0.90–1.00) and there was low heterogeneity between studies (I2 0%). The pooled specificity estimate (2 data points) was moderate: 0.64 (95%CI = 0.52–0.77) and there was moderate heterogeneity between studies (I2 60.1%). There was insufficient data to calculate the pooled PPV and NPV estimates for ADS.

Addiction Severity Index (ASI)

The pooled alpha estimate for ASI (3 data points) was good: 0.84 (95%CI = 0.81–0.87) and there was moderate heterogeneity between studies (I2 38.5%). There was insufficient data to calculate pooled kappa, sensitivity, specificity, PPV, and NPV estimates.

Addiction severity index-alcohol (alcohol sub-scale; ASI-A)

The pooled alpha estimate (18 data points) was moderate: 0.77 (95%CI = 0.73–0.81) and there was high heterogeneity between studies (I2 94.3%). The pooled sensitivity estimate for ASI-A (6 data points) was good: 0.83 (95%CI = 0.67–0.92) and there was high heterogeneity between studies (I2 87.6%). The pooled specificity estimate for ASI-A (6 data points) was moderate: 0.79 (95%CI = 0.67–0.88) and there was high heterogeneity between studies (I2 91.2%). There was insufficient data to calculate pooled kappa, PPV and NPV estimates for ASI-A.

Addiction severity index-drugs (drugs sub-scale; ASI-D)

The pooled alpha estimate for ASI-D (16 data points) was unsatisfactory: 0.68 (95%CI = 0.63–0.74) and there was high heterogeneity between studies (I2 95.6%). The pooled sensitivity estimate (5 data points) was good: 0.86 (95%CI = 0.83–0.89) and there was moderate heterogeneity between studies (I2 62.5%). The pooled specificity estimate (5 data points) was good: 0.85 (95%CI = 0.77–0.91) and there was high heterogeneity between studies (I2 86%). There was insufficient data to calculate the pooled kappa, PPV and NPV estimates.

The alcohol, smoking, and substance involvement screening test (ASSIST)

The pooled alpha estimate (7 data points) was good: 0.85 (95%CI = 0.80–0.91) and there was high heterogeneity between studies (I2 94%). The pooled sensitivity estimate (2 data points) was good: 0.83 (95%CI = 0.80–0.87) and there was low heterogeneity between studies (I2 0%). The pooled specificity estimate (2 data points) was moderate: 0.73 (95%CI = 0.57–0.88) and there was high heterogeneity between studies (I2 91%). There was insufficient data to calculate the pooled estimate for kappa, PPV, and NPV.

Alcohol use disorders identification test (AUDIT)

The pooled alpha estimate for AUDIT (80 data points) was moderate: 0.85 (95%CI = 0.83–0.87) and there was high heterogeneity between studies (I2 98%). The pooled kappa estimate for AUDIT (4 data points) was unsatisfactory: 0.46 (95%CI = 0.25–0.67) and there was high heterogeneity between studies (I2 0.99). The pooled sensitivity estimate for AUDIT (135 data points) was good: 0.86 (95%CI = 0.84–0.88) and there was high heterogeneity between studies (I2 97%). The pooled specificity estimate for AUDIT (135 data points) was good: 0.87 (95%CI = 0.85–0.89) and there was high heterogeneity between studies (I2 99%). The pooled PPV estimate for AUDIT (65 data points) was moderate: 0.61 (95%CI = 0.51–0.71) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for AUDIT (54 data points) was excellent: 0.94 (95%CI = 0.93–0.95) and there was high heterogeneity between studies (I2 96%).

Alcohol use disorders identification Test-3 (AUDIT-3)

Alpha cannot be calculated for AUDIT-3 because it is a single-item measure. There was insufficient data to calculate the pooled estimate for kappa. The pooled sensitivity estimate for AUDIT-3 (22 data points) was good: 0.84 (95%CI = 0.80–0.88) and there was high heterogeneity between studies (I2 90%). The pooled specificity estimate for AUDIT-3 (22 data points) was good: 0.84 (95%CI = 0.75–0.90) and there was high heterogeneity between studies (I2 99%). The pooled PPV estimate for AUDIT-3 (9 data points) was moderate: 0.63 (95%CI = 0.49–0.77) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate (7 data points) was excellent: 0.94 (95%CI = 0.90–0.98) and there was high heterogeneity between studies (I2 95%).

Alcohol use disorders identification test-C (AUDIT-C)

The pooled alpha estimate for AUDIT-C (20 data points) was fair: 0.75 (95%CI = 0.70–0.80) and there was high heterogeneity between studies (I2 99%). The pooled kappa estimate for AUDIT-C (2 data points) was unsatisfactory: 0.41 (95%CI = 0.39–0.43) and there was low heterogeneity between studies (I2 0%). The pooled sensitivity estimate for AUDIT-C (45 data points) was good: 0.87 (95%CI = 0.84–0.90) and there was high heterogeneity between studies (I2 99%). The pooled specificity estimate for AUDIT-C (45 data points) was good: 0.84 (95%CI = 0.81–0.87) and there was high heterogeneity between studies (I2 99%). The pooled PPV estimate for AUDIT-C (22 data points) was low: 0.50 (95%CI = 0.39–0.60) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for AUDIT-C (19 data points) was good: 0.88 (95%CI = 0.83–0.92) and there was high heterogeneity between studies (I2 99%).

Brief Michigan alcoholism screening test (B-MAST)

There was insufficient data to calculate the pooled estimate for B-MAST’s alpha and kappa. The pooled sensitivity estimate for B-MAST (21 data points) was low: 0.50 (95%CI = 0.38–0.62) and there was high heterogeneity between studies (I2 99%). The pooled specificity estimate for B-MAST (21 data points) was excellent: 0.97 (95%CI = 0.96–0.98) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for B-MAST (3 data points) was moderate: 0.65 (95%CI = 0.38–0.93) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for B-MAST (2 data points) was excellent: 0.90 (95%CI = 0.87–0.94) and there was moderate heterogeneity between studies (I2 33%).

Cut down, annoyed, guilty, eye-opener (CAGE)

The pooled alpha estimate for CAGE (22 data points) was unsatisfactory: 0.70 (95%CI = 0.65–0.75) and there was high heterogeneity between studies (I2 98%). The pooled kappa estimate for CAGE (3 data points) was unsatisfactory: 0.57 (95%CI = 0.34–0.81) and there was high heterogeneity between studies (I2 0.97). The pooled sensitivity estimate for CAGE (139 data points) was moderate: 0.70 (95%CI = 0.66–0.74) and there was high heterogeneity between studies (I2 98%). The pooled specificity estimate for CAGE (139 data points) was good: 0.90 (95%CI = 0.88–0.91) and there was high heterogeneity between studies (I2 99%). The pooled PPV estimate for CAGE (61 data points) was low: 0.51 (95%CI = 0.45–0.58) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for CAGE (39 data points) was excellent: 0.91 (95%CI = 0.88–0.93) and there was high heterogeneity between studies (I2 97%).

Composite international diagnostic interview (CIDI), original version, version 2.1 and version 3

Alpha coefficients are not calculated for CIDI. The pooled kappa estimate for the original version of CIDI (2 data points) was moderate: 0.82 (95%CI = 0.61–1.02) and there was high heterogeneity between studies (I2 0.78). There was insufficient data to calculate the pooled estimate for sensitivity, specificity, PPV, and NPV for the original CIDI.

The pooled sensitivity estimate for CIDI version 2.1 (3 data points) was fair: 0.75 (95%CI = 0.69–0.81) and there was low heterogeneity between studies (I2 0.0%). The pooled specificity estimate for CIDI version 2.1 (3 data points) was good: 0.84 (95%CI = 0.69–1.00) and there was high heterogeneity between studies (I2 98.7%). There was insufficient data to calculate the pooled estimate for kappa, PPV, and NPV for CIDI version 2.1.

The pooled sensitivity estimate for CIDI version 3 (4 data points) was excellent: 0.91 (95%CI = 0.82–1.00) and there was moderate heterogeneity between studies (I2 48.1%). The pooled specificity estimate for CIDI version 3 (4 data points) was excellent: 0.99 (95%CI = 0.98–1.00) and there was low heterogeneity between studies (I2 0.0%). The pooled PPV estimate for CIDI version 3 (4 data points) was excellent: 0.91 (95%CI = 0.87–0.96) and there was low heterogeneity between studies (I2 0.0%). The pooled NPV estimate for CIDI version 3 (4 data points) was excellent: 0.99 (95%CI = 0.98–1.00) and there was low heterogeneity between studies (I2 0.0%). There was insufficient data to calculate the pooled estimate for kappa CIDI version 3.

Car, relax, alone, forget, friends, trouble (CRAFFT)

The pooled alpha estimate for CRAFFT (6 data points) was unsatisfactory: 0.69 (95%CI = 0.64–0.74) and there was high heterogeneity between studies (I2 83%). There was insufficient data to calculate the pooled estimate for kappa for CRAFFT. The pooled sensitivity estimate for CRAFFT (10 data points) was good: 0.90 (95%CI = 0.84–0.94) and there was high heterogeneity between studies (I2 97%). The pooled specificity estimate for CRAFFT (10 data points) was moderate: 0.76 (95%CI = 0.68–0.83) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for CRAFFT (8 data points) was low: 0.57 (95%CI = 0.34–0.80) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for CRAFFT (8 data points) was good: 0.86 (95%CI = 0.45–1.00) and there was high heterogeneity between studies (I2 99%).

Drug Abuse screen test (DAST)

The pooled alpha estimate for DAST (6 data points) was excellent: 0.94 (95%CI = 0.93–0.95) and there was low heterogeneity between studies (I2 0%). The pooled kappa estimate for DAST (2 data points) was moderate: 0.83 (95%CI = 0.58–1.00) and there was high heterogeneity between studies (I2 0.98). The pooled sensitivity estimate for DAST (7 data points) was good: 0.85 (95%CI = 0.74–0.92) and there was high heterogeneity between studies (I2 89%). The pooled specificity estimate for DAST (7 data points) was good: 0.84 (95%CI = 0.68–0.93) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for DAST (5 data points) was low: 0.51 (95%CI = 0.32–0.70) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for DAST (4 data points) was excellent: 0.95 (95%CI = 0.89–1.00) and there was high heterogeneity between studies (I2 81%).

Drug Abuse screen test - 10-item version (DAST-10)

The pooled alpha estimate DAST-10 (6 data points) was fair: 0.79 (95%CI = 0.68–0.89) and there was high heterogeneity between studies (I2 98%). There was insufficient data to calculate the pooled estimate for kappa for DAST-10. The pooled sensitivity estimate for DAST-10 (6 data points) was excellent: 0.90 (95%CI = 0.75–0.97) and there was high heterogeneity between studies (I2 95%). The pooled specificity estimate for DAST-10 (6 data points) was good: 0.82 (95%CI = 0.72–0.89) and there was high heterogeneity between studies (I2 92%). The pooled PPV estimate for DAST-10 (4 data points) was good: 0.80 (95%CI = 0.70–0.91) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for DAST-10 (4 data points) was good: 0.86 (95%CI = 0.81–0.91) and there was moderate heterogeneity between studies (I2 40%).

Drug use disorders identification test (DUDIT)

The pooled alpha estimate for DUDIT (15 data points) was excellent: 0.92 (95%CI = 0.90–0.95) and there was high heterogeneity between studies (I2 96%). There was insufficient data to calculate the pooled kappa estimate for DUDIT. The pooled sensitivity estimate for DUDIT (12 data points) was excellent: 0.93 (95%CI = 0.89–0.96) and there was high heterogeneity between studies (I2 76%). The pooled specificity estimate for DUDIT (12 data points) was moderate: 0.79 (95%CI = 0.67–0.87) and there was high heterogeneity between studies (I2 96%). The pooled PPV estimate (5 data points) was moderate: 0.61 (95%CI = 0.34–0.87) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate (5 data points) was excellent: 0.92 (95%CI = 0.82–1.00) and there was high heterogeneity between studies (I2 78%).

Michigan alcohol screening test (MAST)

The pooled alpha estimate for MAST (8 data points) was moderate: 0.82 (95%CI = 0.78–0.86) and there was high heterogeneity between studies (I2 83%). The pooled kappa estimate for MAST (4 data points) was unsatisfactory: 0.69 (95%CI = 0.58–0.81) and there was high heterogeneity between studies (I2 0.88). The pooled sensitivity estimate for MAST (12 data points) was moderate: 0.70 (95%CI = 0.58–0.80) and there was high heterogeneity between studies (I2 95%). The pooled specificity estimate for MAST (12 data points) was good: 0.85 (95%CI = 0.77–0.91) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for MAST (9 data points) was low: 0.51 (95%CI = 0.30–0.71) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for MAST (6 data points) was good: 0.88 (95%CI = 0.82–0.94) and there was high heterogeneity between studies (I2 92%).

Problem oriented screening instrument for teenagers (POSIT)

The pooled alpha estimate for POSIT (2 data points) was good: 0.86 (95%CI = 0.73–0.98) and there was high heterogeneity between studies (I2 94%). The pooled sensitivity estimate for POSIT (3 data points) was good: 0.84 (95%CI = 0.72–0.96) and there was high heterogeneity between studies (I2 90%). The pooled specificity estimate for POSIT (3 data points) was good: 0.82 (95%CI = 0.75–0.90) and there was high heterogeneity between studies (I2 88%). There was insufficient data to calculate the pooled kappa, PPV, and NPV estimates for POSIT.

Self-administered alcoholism screening test (SAAST)

The pooled alpha estimate for SAAST (2 data points) was good: 0.89 (95%CI = 0.79–0.99) and there was high heterogeneity between studies (I2 95%). The pooled sensitivity estimate for SAAST (7 data points) was low: 0.52 (95%CI = 0.33–0.71) and there was high heterogeneity between studies (I2 98%). The pooled specificity estimate (7 data points) was good: 0.83 (95%CI = 0.76–0.90) and there was high heterogeneity between studies (I2 98%). The pooled PPV estimate for SAAST (6 data points) was low: 0.32 (95%CI = 0.22–0.42) and there was high heterogeneity between studies (I2 95%). The pooled NPV estimate for SAAST (6 data points) was excellent: 0.92 (95%CI = 0.89–0.95) and there was high heterogeneity between studies (I2 92%). There was insufficient data to calculate the pooled kappa estimates for SAAST.

Semi-structured assessment for drug dependence and alcoholism (SSADDA)

There are no alpha coefficients associated with semi-structures assessments such as SSADDA. The pooled kappa estimate for SSADDA (8 data points) was moderate: 0.84 (95%CI = 0.77–0.91) and there was high heterogeneity between studies (I2 0.97). There was insufficient data to calculate the pooled sensitivity, specificity, PPV and NPV estimates for SSADDA.

Severity of dependence (SDS)

The pooled alpha estimate for SDS (6 data points) was good: 0.86 (95%CI = 0.78–0.93) and there was high heterogeneity between studies (I2 95%). The pooled sensitivity estimate for SDS (6 data points) was good: 0.83 (95%CI = 0.76–0.90) and there was high heterogeneity between studies (I2 77%). The pooled specificity estimate (6 data points) was good: 0.84 (95%CI = 0.78–0.89) and there was moderate heterogeneity between studies (I2 44%). The pooled PPV estimate for SDS (3 data points) was good: 0.90 (95%CI = 0.86–0.94) and there was low heterogeneity between studies (I2 0%). The pooled NPV estimate for SDS (3 data points) was good: 0.83 (95%CI = 0.76–0.89) and there was low heterogeneity between studies (I2 3.5%). There was insufficient data to calculate the pooled kappa estimate for SDS.

Tolerance-annoyance cut down eye opener (T-ACE)

The pooled alpha estimate for T-ACE (2 data points) was unsatisfactory: 0.50 (95%CI = 0.47–0.52) and there was high heterogeneity between studies (I2 29%). The pooled sensitivity estimate for T-ACE (8 data points) was good: 0.83 (95%CI = 0.74–0.92) and there was high heterogeneity between studies (I2 96%). The pooled specificity estimate for T-ACE (8 data points) was moderate: 0.72 (95%CI = 0.65–0.79) and there was high heterogeneity between studies (I2 98%). The pooled PPV estimate for T-ACE (6 data points) was low: 0.35 (95%CI = 0.25–0.45) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for T-ACE (2 data points) was good: 0.87 (95%CI = 0.62–1.00) and there was high heterogeneity between studies (I2 97%). There was insufficient data to calculate the pooled estimate for kappa for T-ACE.

Timeline Followback (TLFB)

There are no alpha coefficients associated with TLFB. The pooled kappa estimate for TLFB (3 data points) was good: 0.86 (95%CI = 0.81–0.91) and there was high heterogeneity between studies (I2 0.88). The pooled sensitivity estimate for TLFB (4 data points) was moderate: 0.80 (95%CI = 0.73–0.87) and there was moderate heterogeneity between studies (I2 63%). The pooled specificity estimate for TLFB (3 data points) was excellent: 0.97 (95%CI = 0.95–0.99) and there was low heterogeneity between studies (I2 0%). There was insufficient data to calculate the pooled estimate for PPV and NPV for TLFB.

Tolerance, worried, eye-opener, amnesia, cut down (TWEAK)

The pooled alpha estimate for TWEAK (3 data points) was unsatisfactory: 0.62 (95%CI = 0.55–0.69) and there was high heterogeneity between studies (I2 86%). The pooled sensitivity estimate for TWEAK (36 data points) was good: 0.85 (95%CI = 0.80–0.89) and there was high heterogeneity between studies (I2 96%). The pooled specificity estimate for TWEAK (36 data points) was good: 0.86 (95%CI = 0.82–0.90) and there was high heterogeneity between studies (I2 99%). The pooled PPV estimate for TWEAK (5 data points) was low: 0.43 (95%CI = 0.26–0.61) and there was high heterogeneity between studies (I2 99%). The pooled NPV estimate for TWEAK (2 data points) was good: 0.88 (95%CI = 0.70–1.00) and there was high heterogeneity between studies (I2 95%). There was insufficient data to calculate the pooled estimate for kappa for TWEAK.

The chemical use, Abuse, and dependence (CUAD)

The pooled alpha estimate for CUAD (3 data points) was excellent: 0.96 (95%CI = 0.94–0.98) and there was high heterogeneity between studies (I2 95%). There was insufficient data to calculate the pooled estimate for kappa, sensitivity, specificity, PPV, and NPV for CUAD.

Biomarkers

Alanine transaminase (ALT)

The pooled sensitivity estimate for ALT (32 data points) was low: 0.32 (95%CI = 0.24–0.40) and there was high heterogeneity between studies (I2 96.1%). The pooled specificity estimate for ALT (32 data points) was good: 0.88 (95%CI = 0.83–0.92) and there was high heterogeneity between studies (I2 95.8%). The pooled PPV estimate for ALT (7 data points) was low 0.37 (95%CI = 0.18–0.56) and there was high heterogeneity between studies (I2 96.1%). The pooled NPV estimate for ALT (4 data points) was moderate: 0.63 (95%CI = 0.42–0.85) and there was high heterogeneity between studies (I2 97.5%).

Aspartate transaminase (AST)

The pooled sensitivity estimate for AST (33 data points) was low: 0.48 (95%CI = 0.40–0.55) and there was high heterogeneity between studies (I2 97%). The pooled specificity estimate for AST (33 data points) was good: 0.86 (95%CI = 0.81–0.90) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for AST (8 data points) was low: 0.42 (95%CI = 0.27–0.57) and there was high heterogeneity between studies (I2 93%). The pooled NPV estimate for AST (6 data points) was moderate: 0.69 (95%CI = 0.55–0.83) and there was high heterogeneity between studies (I2 95%).

Aspartate transaminase, alanine transaminase ratio (AST/ALT ratio)

The pooled sensitivity estimate for AST/ALT ratio (6 data points) was low: 0.34 (95%CI = 0.22–0.46) and there was high heterogeneity between studies (I2 96%). The pooled specificity estimate (4 data points) was moderate: 0.73 (95%CI = 0.52–0.94) and there was high heterogeneity between studies (I2 98%). There was insufficient data to calculate the pooled estimate for PPV and NPV.

Blood alcohol concentration (BAC)

The pooled sensitivity estimate for BAC (5 data points) was moderate: 0.64 (95%CI = 0.59–0.69) and there was moderate heterogeneity between studies (I2 44%). The pooled specificity estimate for BAC (5 data points) was moderate: 0.80 (95%CI = 0.72–0.87) and there was high heterogeneity between studies (I2 93%). The pooled PPV estimate for BAC (3 data points) was low: 0.60 (95%CI = 0.15–1.00) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for BAC (3 data points) was moderate: 0.69 (95%CI = 0.52–0.86) and there was high heterogeneity between studies (I2 93%).

Carbohydrate deficient transferrin (CDT)

There are no alpha and kappa coefficients associated with biomarkers such as CDT. The pooled sensitivity estimate for CDT (8 data points) was low: 0.59 (95%CI = 0.43–0.73) and there was high heterogeneity between studies (I2 97%). The pooled specificity estimate for CDT (8 data points) was excellent: 0.96 (95%CI = 0.93–0.98) and there was moderate heterogeneity between studies (I2 72%). The pooled PPV estimate for CDT (6 data points) was good: 0.85 (95%CI = 0.74–0.97) and there was high heterogeneity between studies (I2 76%). The pooled NPV estimate for CDT (6 data points) was moderate: 0.79 (95%CI = 0.73–0.85) and there was high heterogeneity between studies (I2 96%).

Carbohydrate deficient transferrin-tech (CDTech)

There are no alpha and kappa coefficients associated with biomarkers such as CDTech. The pooled sensitivity estimate for CDTech (41 data points) was low: 0.54 (95%CI = 0.45–0.62) and there was high heterogeneity between studies (I2 99%). The pooled specificity estimate for CDTech (41 data points) was good: 0.89 (95%CI = 0.88–0.91) and there was high heterogeneity between studies (I2 88%). The pooled PPV estimate for CDTech (12 data points) was low: 0.52 (95%CI = 0.37–0.67) and there was high heterogeneity between studies (I2 95%). The pooled NPV estimate for CDTech (8 data points) was moderate: 0.80 (95%CI = 0.61–0.98) and there was high heterogeneity between studies (I2 99%).

Carbohydrate deficient transferrin with mean corpuscular volume (CDT with MCV)

There are no alpha and kappa coefficients associated with biomarkers such as CDT and MCV. The pooled sensitivity estimate for CDT with MCV (8 data points) was moderate: 0.74 (95%CI = 0.60–0.88) and there was high heterogeneity between studies (I2 98%). The pooled specificity estimate for CDT with MCV (4 data points) was excellent: 0.93 (95%CI = 0.91–0.95) and there was low heterogeneity between studies (I2 0%). The pooled PPV estimate for CDT with MCV (4 data points) was moderate: 0.74 (95%CI = 0.51–0.97) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for CDT with MCV (4 data points) was excellent: 0.92 (95%CI = 0.83–1.00) and there was high heterogeneity between studies (I2 95%).

Gamma-Glutamyl Transferase (GGT)

There are no alpha and kappa coefficients associated with biomarkers such as GGT. The pooled sensitivity estimate for GGT (76 data points) was low: 0.57 (95%CI = 0.50–0.64) and there was high heterogeneity between studies (I2 99%). The pooled specificity estimate for GGT (76 data points) was good: 0.83 (95%CI = 0.78–0.86) and there was high heterogeneity between studies (I2 98%). The pooled PPV estimate for GGT (30 data points) was low: 0.43 (95%CI = 0.35–0.51) and there was high heterogeneity between studies (I2 97%). The pooled NPV estimate for GGT (23 data points) was good: 0.82 (95%CI = 0.70–0.94) and there was high heterogeneity between studies (I2 99%).

Gamma-Glutamyl Transferase with mean corpuscular volume (GGT with MCV)

There are no alpha and kappa coefficients associated with biomarkers such as GGT and MCV. The pooled sensitivity estimate for GGT with MCV (10 data points) was moderate: 0.64 (95%CI = 0.38–0.84) and there was high heterogeneity between studies (I2 99%). The pooled specificity estimate for GGT with MCV (10 data points) was good: 0.87 (95%CI = 0.76–0.93) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for GGT with MCV (6 data points) was low: 0.47 (95%CI = 0.28–0.66) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for GGT with MCV (6 data points) was good: 0.88 (95%CI = 0.81–0.95) and there was high heterogeneity between studies (I2 94%).

Ethyl glucuronide (EtG)

There are no alpha and kappa coefficients associated with biomarkers such as EtG. The pooled sensitivity estimate for EtG (6 data points) was good: 0.83 (95%CI = 0.61–0.94) and there was high heterogeneity between studies (I2 91%). The pooled specificity estimate for EtG (6 data points) was excellent: 0.95 (95%CI = 0.90–0.98) and there was high heterogeneity between studies (I2 66%). The pooled PPV estimate for EtG (2 data points) was moderate: 0.61 (95%CI = 0.39–0.84) and there was moderate heterogeneity between studies (I2 58%). The pooled NPV estimate for EtG (2 data points) was good: 0.86 (95%CI = 0.78–0.94) and there was moderate heterogeneity between studies (I2 60%).

Mean corpuscular volume (MCV)

There are no alpha and kappa coefficients associated with biomarkers such as MCV. The pooled sensitivity estimate for MCV (55 data points) was low: 0.39 (95%CI = 0.33–0.45) and there was high heterogeneity between studies (I2 97%). The pooled specificity estimate for MCV (55 data points) was excellent: 0.91 (95%CI = 0.88–0.93) and there was high heterogeneity between studies (I2 98%). The pooled PPV estimate for MCV (28 data points) was low: 0.48 (95%CI = 0.36–0.59) and there was high heterogeneity between studies (I2 98%). The pooled NPV estimate for MCV (22 data points) was moderate: 0.79 (95%CI = 0.73–0.86) and there was high heterogeneity between studies (I2 99%).

Percent carbohydrate deficient transferrin (%CDT)

The pooled sensitivity estimate for %CDT (40 data points) was low: 0.56 (95%CI = 0.47–0.65) and there was high heterogeneity between studies (I2 98.2%). The pooled specificity estimate for %CDT (40 data points) was 0.91, which is considered as excellent (95%CI = 0.88–0.94) and there was high heterogeneity between studies (I2 97%). The pooled PPV estimate for %CDT (13 data points) was low: 0.58 (95%CI = 0.38–0.78) and there was high heterogeneity between studies (I2 98.5%). The pooled NPV estimate for %CDT (13 data points) was good: 0.85 (95%CI = 0.78–0.92) and there was high heterogeneity between studies (I2 97.6%).

Phosphatidylethanol (PEth)

There are no alpha and kappa coefficients associated with biomarkers such as PEth. The pooled sensitivity estimate for PEth (7 data points) was good: 0.87 (95%CI = 0.79–0.96) and there was high heterogeneity between studies (I2 94%). The pooled specificity estimate for PEth (4 data points) was excellent: 0.94 (95%CI = 0.91–0.97) and there was moderate heterogeneity between studies (I2 31%). There was insufficient data to calculate the pooled estimate for PPV and NPV for PEth.

Discussion

In this systematic review and meta-analysis, we identified 387 unique papers that have published data on the validity, reliability and diagnostic accuracy of 37 scales for substance classes that are associated with HIV risk. We observed based on meta-analyzable data available, that fourteen of the thirty-seven measures/scales (38%) that had all pooled estimates consistently meet criteria for acceptability (e.g., ranging between fair/moderate-to-excellent), which included the following self-reported measures:

  • Alcohol Dependence Scale

  • Addiction Severity Index (ASI)

  • ASI subscale for Alcohol; ASSIST

  • Composite International Diagnostic Interview (version original, version 2.1, and version 3)

  • Drug Abuse Screen Test - 10 item scale

  • Drug Use Disorders Identification Test

  • Problem Oriented Screening Instrument for Teenagers

  • Severity of Dependence scale

  • Timeline Followback

  • Chemical Use, Abuse, and Dependence

Biomarkers that had all pooled estimates that were fair/moderate or better include the following:

  • Ethyl glucuronide

  • Phosphatidylethanol test

  • The combined used of Carbohydrate deficient transferrin and Mean corpuscular volume.

Taken together, our findings highlight the availability of a promising range of tools for researchers and practitioners when assessing substance use, particularly those working with classes of substances associated with HIV risk, such as heroin, methamphetamine, cocaine, ecstasy, and alcohol. Nevertheless, further research is needed to determine why some substance use measures do not consistently have acceptable psychometric properties across different studies.

Overall, while most of the self-reported scales had acceptable validity, most did not have acceptable reliability: 65% of pooled estimates for alpha were in the range of fair-to-excellent though only 44% of estimates for kappa were in the range of fair-to-excellent. Moreover, a greater proportion of the scales we identified and meta-analyzed were better at correctly identifying individuals who are truly not using substances/not problematic users among those truly without these conditions (specificity: 97% of summary estimates) and among those who were deemed as not having this condition in the scale (negative predictive value: 96%). In contrast to specificity and negative predictive value estimates, fewer scales had pooled estimates on sensitivity and positive predictive value that were in the fair-to-excellent range (69 and 37%, respectively). These may have implications in the application of these measures in different settings. For example, in the criminal justice system, it may be better to utilize measures that have high specificity and negative predictive properties if the priority is to avoid false-positive results. However, in health settings, it may be more ideal to use measures with better sensitivity and positivity to better capture individuals who may require further assessment for substance use disorder assessments and treatment referrals, as appropriate.

Overall, the studies identified in this review had administered scales in English, were conducted within in the United States, and were less commonly tested among exclusively-women samples (there were twice as many exclusively-men samples in comparison). These findings highlight the general lack of diversity in terms of language, setting, and study population for the studies reporting validity, reliability, and diagnostic accuracy on substance use measures. Given the high morbidity and mortality associated with substance use globally and for different risk populations, greater effort is needed to further evaluate the psychometric properties of substance use measures in such samples. This study also found that few papers on substance use psychometric properties are “low risk” across all QUADAS 2 domains (16%). This finding highlights the need to further study the validity, reliability, and diagnostic accuracy of substance use measures using studies designed with better methodological rigor to reduce risk of bias.

This present study has several limitations. First, our inclusion criteria may have excluded some potentially relevant studies on the psychometric properties of substance use measures that were not published in English. Hence, although we included measures that were not administered in English as long as they were published in English, our findings may not necessarily be generalizable to the psychometric properties of non-English measures that were not published in English. It should also be noted that our eligibility criteria likely favored the inclusion of studies that were conducted in settings where English proficiency was higher, which is correlated with countries with higher gross national income per capita [43]. Moreover, while our search strategy was developed to try and identify all the relevant studies, many publications that have calculated our psychometric properties of interest may not have language referencing the specific key words/terms in our strategy in their titles and/or abstracts. In particular, this may occur because the psychometric data of scales may not be considered a “primary outcome” of a study, and thus not be highlighted in the title or abstract (i.e., the relevant data are imbedded within the full-text only). Additionally, while we did not specifically seek out studies only among HIV-risk populations, per se, our study did focus on substance classes that have been associated with HIV risk, namely alcohol, stimulants (methamphetamine, amphetamine, cocaine, ecstasy), and heroin. Hence, our search may have missed studies on more general substance use measures that did not explicitly name our targeted substance classes. Furthermore, we were unable to calculate pooled estimates for some psychometric outcomes of several measures due to lack of published data or insufficient data, including for some widely used assessments previously shown to be valid and reliable, such as the DSM-IV diagnostic modules used in the US National Surveys of Drug Use and Health, the Diagnostic Interview schedule, and the AUDADIS [44,45,46]. Another limitation in our meta-analysis is related to our narrow definition of validity, which focused on internal validity as measured by Cronbach’s alpha values. We acknowledge that there are a range of other characteristics that examine validity that we did not include in our analysis such as criterion validity, predictive validity, and other psychometric properties [32]. Further research is needed to fill our gaps in knowledge on the psychometric properties of these substance use measures to enable pooled summary estimate calculations. In addition, we recognize the limitation from pooling alpha and kappa statistics from clinical and epidemiologic/community samples given how these statistical measures are margin-sensitive. Moreover, with respect to the synthesis of data on sensitivity and specificity, we acknowledge that some studies may have used imperfect gold-standards, which may lead to distorted values for the individual estimates for sensitivity and specificity. Therefore, it may be appropriate to refer to results as co-positivity and co-negativity, as suggested by Buck and Gart [47]. Finally, we also recognized that disease spectrum severity and prevalence can affect test performance for sensitivity and specificity [48, 49]. Our results should be interpreted with these limitations in mind.

To our knowledge, this is the first systematic review and meta-analysis involving the synthesis of psychometric data across different measures of substances that are associated with HIV risk. As mentioned, limited research has been conducted with respect with quantitatively pooling the psychometric characteristics of substance use measures. Our findings highlight the general strengths of many substance use measures with respect to their validity, reliability, and diagnostic accuracy across multiple studies/samples. To facilitate the dissemination of these findings, and provide researchers with a resource to identify validated, reliable, and accurate measures for substance use, we collaborated with members of the HIV Prevention Trials Network (HPTN) Substance Use Scientific Committee to develop a web-based tool, with the results of the pooled summary estimates presented in this study. The tool, named “Substance Use Measure Identification (SUMI) Tool” is available as a free resource in the HPTN's website (URL: https://www.hptn.org/researchtools).

Conclusion

In summary, researchers in the field of substance use should endeavor to conduct more validity, reliability, and diagnostic accuracy studies on measures to identify substance use and use disorders among more diverse settings and populations, and with more rigorous study designs. Ultimately, accurate identification of substance users and problematic substance use is a critical step in identifying individuals for substance use treatment and evaluating the effectiveness of treatment strategies. Hence, further evaluation of substance use measures is of great importance not only to the field of substance use research, but also substance use treatment. Given the substantial contribution of substance use to the global burden of disease [5], having robust data on the.

psychometric properties of substance use measure can help researchers identify the best tools to use in research studies, further enhancing the collection of more valid, reliable, accurate data to inform evidence-based responses to substance use.