The accurate assessment of health risk behaviors is essential for those wanting to describe and predict trends, identify populations at risk, and evaluate the effectiveness of interventions, as well as to advocate for support and to develop policies and programs (Brener et al. 2003; Falck et al. 1992; Kalichman et al. 1997; Kauth et al. 1991; Weinhardt et al. 1998). The assessment of Human Immunodeficiency Virus (HIV) risk behaviors is complex due to the inherently private, often stigmatized, and sometimes illegal nature of these drug-use and sexual risk behaviors. To assess such risk behaviors, researchers often rely on individuals’ self-reports for both practical and ethical reasons.

A variety of approaches have been utilized to assess the reliability of people’s self-reports of HIV risk behaviors. Research has demonstrated that data collected from high-risk populations, such as drug users, are, on the whole, reliable (Darke 1998; Dowling-Guyer et al. 1994; Goldstein et al. 1995; Johnson et al. 2000). However, there are a variety of factors likely to affect the reliability of data collected in this manner, such as individuals’ motivation and ability to respond accurately. Fear of legal reprisal or self-presentation biases may lead participants to hide behaviors that they perceive to be undesirable or stigmatized (Catania et al. 1990a; Hser et al. 1992; Latkin et al. 1993; Weinhardt et al. 1998).

Reliability of measures of HIV risk behavior may also be impacted by factors associated with the measure itself (Blair and Burton 1987). For example, the length of time for which participants are asked to recall risk behaviors, or the recall period, is likely to affect the reliability of a measure (Blair and Burton 1987). It seems reasonable to expect people to find it easier to remember behaviors over recent short periods of time compared to longer periods of time. Easier-to-remember recall periods should be more reliable than longer more difficult-to-recall periods.

The length of the recall period used for self-report instruments has implications for the strategy used to recall behavioral frequency (Conrad et al. 1998; Jaccard and Wan 1995; McFarlane and Lawrence 1999). This, in turn, affects the reliability of self-report data. Shorter recall periods are more likely to lead people to use episodic recall strategies, such as enumeration (Bogart et al. 2007), that are thought to be more reliable than other strategies (Conrad et al. 1998; Jaccard et al. 2002). Enumeration involves scanning a recall period for a particular behavior and counting all recalled instances of that behavior that occurred within the recall period (Jaccard and Wan 1995). Episodic enumeration may be common when behaviors are infrequent, irregular, or distinctive (Blair and Burton 1987; Conrad et al. 1998). However, if a behavior occurs frequently and episodes are indistinct, enumeration may become increasingly difficult, time consuming, and less likely to occur (Blair and Burton 1987).

Instead of enumeration, rate-based inferences (Conrad et al. 1998) may be used to recall how often an event occurs during a representative period (e.g., once a week) and multiply it by the length of the recall period (e.g., 12 times in a 3-months period). If the number of events or the rate of events is not retrievable, other strategies such as qualitative impressions, memory assessments, or normative expectations may be used (Conrad et al. 1998; Jaccard and Wan 1995). The use of such mental calculations and impressions can be imprecise and inconsistent (Bogart et al. 2007; Downey et al. 1995; Jaccard et al. 2002). As the recall period increases, so too may the use of these recall strategies, leading to the risk of reduced reliability of self-report data (Conrad et al. 1998).

Several researchers have examined the relationship between recall period and the test-retest reliability of self-report data, with some providing evidence to support the concept that shorter recall periods may be more reliable than longer recall periods. Kauth et al. (1991) compared sexual risk behavior self reports for 2-week, 3- and 12-months periods and argued that, as the length of recall period increased, inconsistency in responding increased. However, this study did not use a true test-retest methodology, and instead extrapolated data from 2 weeks and 3 months to be equivalent to 12 months. Catania (unpublished data cited in Catania et al. 1990b) assessed the test-retest reliability of college students’ reports of frequency of vaginal intercourse using varying recall periods. Their results suggested that, as the recall interval increased from 1 month to 1 year, test-retest reliability of the measure decreased. In a study examining recall of substance use, Martin et al. (1998) found that shorter recall periods (30 and 90 days) were more reliable than longer recall periods (180 and 360 days).

On the other hand, some studies have failed to find differences in reliability between different recall periods, or have found inconsistent results. Using the Timeline Followback (TLFB) method, a variety of studies found no differences in reliability as the length of recall period increased (Carey et al. 2001; Ehrman and Robbins 1994; Levy et al. 2004; Sacks et al. 2003). Klinkenberg et al. (2002) compared recall periods of 3 and 6 months and found that recall of alcohol and drug use was more reliable at 3 months and recall of number of sexual partners more reliable at 6 months. Jaccard et al. (2002) examined self-reports of condom use and sexual behaviors, and found recall periods of 3 and 6 months to be more optimal than 1 and 12 months. Jaccard et al. (2004) compared recall periods of 1, 3, 6 and 12 months, and found that, for those with multiple sex partners, recall errors in self-report of number of sex partners increased as recall period increased. However, the correlation between self-report and behavior was highest at 6 months which Jaccard et al. attributed to restricted variability in responses for shorter recall periods. Graham et al. (2003) compared recall periods of one, two, and 3 months and found evidence that, for a high frequency behavior (i.e., heterosexual vaginal sex), accuracy decreased as recall period increased. However, findings indicated that, for infrequent behaviors, reliability did not decrease over time.

Given that research addressing the reliability of different recall periods has produced varying results and little consensus on what length of recall period is optimal (Jaccard et al. 2002), there have been a variety of calls for further research to examine this topic (Catania et al. 1990b, 1993; Downey et al. 1995; Noar et al. 2006; Schroder et al. 2003). The lack of agreement on the appropriate recall period reduces the comparability of different studies examining the impact of HIV risk behaviors (Catania et al. 1990b) and hinders research in this area. To address this lack of research, the present meta-analysis reviews and extends previous studies by comparing the test-retest reliability of three commonly used recall periods (1, 3, and 6 months). In doing so, our aim was to inform future researchers about differences in reliability and to draw attention to the importance of comparing the reliability of different recall periods, so that researchers may develop optimal self-report instruments for assessing HIV risk behaviors.


Selection of Studies

Papers published in English that examined the test-retest reliability of measures of sex and drug use behaviors were selected. Studies were identified using electronic databases (PsycInfo, PsycArticles, PubMed), and review articles (Noar et al. 2006; Weinhardt et al. 1998). Multiple search terms were used in combination including recall period, reference period, self-report, test-retest, sex, drug use, HIV risk, and reliability. Authors were contacted to request any relevant published or unpublished data. The references sections of potential articles were checked for additional citations.

All studies considered for inclusion had to meet the following criteria:

  1. 1.

    Studies had to include test-retest reliability of recall of HIV risk-related behaviors. Studies that examined consistency between two partners’ recall of behaviors, comparisons of two different approaches to measurement (for example, comparison of diary methods and single-item recall methods), or compared recall of behaviors for the same length of recall period, but for two time periods which did not overlap, were excluded from the analysis.

  2. 2.

    Only studies that reported assessing behaviors over the prior 1, 3, and/or 6 months were considered for inclusion. Studies that did not report a specific recall period in any form were excluded.

  3. 3.

    Measures assessing HIV risk-related behaviors, including sex behaviors and drug use behaviors, were included. Studies that examined the reliability of self reports of attitude, opinion, craving, or substance dependence were excluded.

  4. 4.

    Only studies that reported the reliability of continuous measures of risk behaviors were considered for inclusion. Because of differences in the ways people are likely to recall frequency data (e.g., How many times do you use crack?) and categorical data (e.g., Did you ever use crack?), studies that only assessed categorical data were excluded.

  5. 5.

    Studies that reported Pearson’s correlation coefficients or interclass correlations were included in the analysis. Studies that examined ordinal level data (e.g., response options of: once a month, once a week, once a day) were excluded from the analysis.

  6. 6.

    Only studies for which the sample size was available were included.

In total, 28 studies yielded over 300 test-retest effect sizes. Based on the studies that reported the demographics of their samples, ages of those included in the studies ranged from 12 to 74 years old, with the majority being male. The sample included in-treatment and out-of-treatment drug users, sex workers, psychiatric patients, and adolescents. A description of the studies included in the meta-analysis can be found in Tables 1 and 2.

Table 1 Description of studies reporting test-retest reliability of drug use variables
Table 2 Description of studies reporting test-retest reliability of sex behavior variables

Aggregation of Within-Sample Effect Sizes

The majority of studies included the assessment of test-retest reliability for multiple items. To avoid including multiple statistics from the same study in the meta-analysis, leading to non-independence of the effect sizes, correlations from the same study were aggregated to provide a mean correlation (Lipsey and Wilson 2001, p. 125). Aggregated effect sizes were calculated separately for drug and sex behaviors, and were used to compute combined effect sizes examining self-report of all drug use behaviors and all sex behaviors. In addition, separate analyses were performed looking at more specific drug and sex behaviors, for example, use of different types of drugs and self-reports of different types of sex behaviors. For studies that reported more than one statistic for one of these more specific behaviors, these statistics were aggregated before being included in the analysis of the different types of behaviors. For example, if a study reported ten items assessing drug use behavior, these items were aggregated for the combined drug use analysis. If the same study reported three items assessing marijuana use, these three items were aggregated for the marijuana analysis. If the sample sizes varied for individual analyses within studies, the mean correlation was calculated by converting the correlations to Fisher’s Z, weighting the values by n−3, calculating the mean, and then transforming the mean back into a correlation coefficient.

Correlational Analysis

Effect sizes were computed using the procedures outlined in Hedges and Olkin (1985). An effect size was calculated for each behavior by converting relevant correlations to Fisher’s Z, weighting the values by the sample size, calculating the mean, and then transforming the mean back into a correlation coefficient. Using the formulas supplied by Hedges and Olkin (1985, p. 227), 95% confidence intervals (95% CI) were calculated. Z tests were used to compare effect sizes for the three recall periods, both for the combined drug and sex variables and for more specific drug-use and sex behaviors.


Results of the meta-analysis are presented in Tables 3 (drug variables) and 4 (sex variables). For each analysis, the population reliability coefficient, number of studies included, total sample size, and 95% CI are reported. Reliability coefficients for the combined-drug variables are provided in Table 3 and labeled “All drug variables.” As indicated, these reliabilities are good when a 30-days, 3-, or 6-months recall period are used. Across all drug variables, the test-retest reliability for a recall period of 30 days (r = .90) was found to be greater than that of 3 months (r = .84; Z = 4.30, P < .001) and 6 months (r = .83; Z = 4.93, P < .001). The reliability of the data using a 3-months recall period did not differ significantly from the 6-months recall period (Z = .58, ns).

Table 3 Test-retest reliability of drug use variables

Marijuana use was found to be most reliably reported when a recall period of 30 days was used (r = .92), in comparison to both 3 months (r = .85; Z = 6.49, P < .001) and 6 months (r = .85; Z = 6.65, P < .001). The 3- and 6-months recall periods were not found to differ significantly (Z = .11, ns). Self-reports of cocaine use were found to be more reliable for longer recall periods. Compared to the 30-days recall period (r = .80), both the 3-months (r = .88; Z = 3.34, P < .001) and 6-months recall periods (r = .87; Z = 3.44, P < .001) were more reliable, and did not differ significantly from one another (Z = .45, ns).

Test-retest reliability of amphetamine use was found to be higher for a recall period of 30 days (r = .93) compared to 6 months (r = .80; Z = 4.12, P < .001). The reliability of self reports of heroin use did not differ significantly between the 30 days (r = .80) and 6-months recall period (r = .83; Z = −1.00, ns). Self-reports of sharing works (needles/syringes/cookers/cottons) had the lowest test-retest reliability, but did not differ significantly between 30 days (r = .69) and 6 months (r = .73; Z = −1.33, ns). The literature review revealed too few studies reporting a recall period of 3 months for amphetamines, heroin, and sharing works; thus, for these three variables, analyses were limited to comparing 30-days and 6-months recall periods.

The combined-sex behaviors measures are labeled “All sex variables” in Table 4. Self report of sex behaviors across all items was found to be most reliable when a recall period of 3 months was used (r = .95), compared to both 30 days (r = .82; Z = 10.99, P < .001) and 6 months (r = .82; Z = 8.31, P < .001). There was no significant difference in the reliability between the 30-days and 6-months recall periods (Z = .23, ns).

Table 4 Test-retest reliability of sex behavior variables

A similar pattern of results was found for vaginal sex and oral sex. The recall period of 3 months was most reliable for recall of vaginal sex (r = .97), when compared to 30 days (r = .84; Z = 10.01, P < .001) and 6 months (r = .62; Z = 11.47, P < .001). The 30-days recall period was more reliable than the 6-months recall period (Z = 5.14, P < .001). The recall period of 3 months was most reliable for oral sex (r = .90), when compared to 30 days (r = .77; Z = 4.41, P < .001) and 6 months (r = .61; Z = 5.34, P < .001). The 30-days recall period was more reliable than the 6-months recall period for recall of oral sex (Z = 2.22, P < .001). The 30-days recall period (r = .90) was also more reliable than the 6-months recall period for recall of anal sex (r = .58; Z = 5.87, P < .001).

For recall of number of sexual partners, the recall period of 6 months was more reliable (r = .93) than both 30 days (r = .79; Z = 9.17, P < .001) and 3 months (r = .85; Z = 4.94, P < .001). The 3-months recall period was more reliable than the 30-days recall period (Z = 2.77, P < .01).


Using meta-analysis, the present study sought to examine the test-retest reliability of commonly used recall periods. Understanding what influence, if any, the length of recall period has on the reliability of self-report data is important for designing measures. The current analysis demonstrates that the reliability of self-reports of sex and drug behaviors, for different lengths of recall periods, depends upon the particular behavior assessed.

For most drug-use behaviors, all three recall periods (30 days, 3, and 6 months) demonstrated acceptable reliability. Overall, the 30-days recall period produced the most reliable recall period when examining all drug-use behavior items combined. When more specific behaviors were examined, self-report of marijuana was found to be most reliable for shorter recall periods (30 days). This finding is consistent with the suggestion that for more frequent behaviors, shorter recall periods may be more accurate (McFarlane and Lawrence 1999), with marijuana being the most frequently reported illicit drug used in the United States (Office of Applied Studies 2007). Amphetamine use was also found to be more reliable for shorter recall periods. Very few studies were identified that examined the reliability of self-report of amphetamine use. Future studies are needed to examine whether shorter recall periods provide more reliable alternatives for self-reports of amphetamine use.

Although past researchers have suggested that self-reports of drug use may be more reliable with shorter recall periods (Kauth et al. 1991; Martin et al. 1998), the current analysis suggests that this is not always the case for all drugs. For example, whereas length of recall period did not affect the reliability of self-reports of heroin use and sharing of drug-use equipment, cocaine/crack use was more reliably reported when longer recall periods (3 and 6 months) were used. Several reasons may explain why shorter recall periods lead to less reliable self-reports for these drugs. Attenuation due to restricted range may reduce the reliability estimates for shorter periods during which there is less variability in reports of frequency of drug use. Drug use patterns have been found to be highly variable (Samuels et al. 1992), and it may be that longer recall periods are needed to capture some of these behaviors reliably.

Changes in reliability of recall of sex behaviors and partners across the recall periods may also reflect attenuation due to restricted range. The reliability of recall of number of sexual partners was found to increase as the length of recall period increased. This may reflect an increase in variability of number of sexual partners reported as the length of recall increases. For recall of sexual activity, a recall period of 30 days may be too short for some individuals to report having engaged in this behavior. That is, short recall periods may produce little variations in self-reports of frequency of sexual behaviors compared to, for example, a 3-months recall period. On the other hand, individuals may not be able to accurately recall their sexual behaviors over a longer period of 6-months, thus causing reliability to decrease.

Jaccard et al. (2002) predicted that, for self-reports of sexual behaviors, moderate length recall periods (3 or 6 months) compared to shorter recall periods (1 month) would be more reliable. These researchers argued that moderate-length recall periods may lead to those who engage in sex infrequently providing fairly reliable estimates of sexual behavior using enumeration strategies. In contrast, those who engage in frequent sex may be discouraged from using episodic strategies and instead use rule-based strategies which, for frequent behavior, may be more reliable. Therefore, moderate-length recall periods may maximize accurate recall of sexual behaviors for both those who engage in sex frequently and infrequently. This type of pattern of results is seen for the self-report of vaginal and oral sex, with the 30-days recall period producing lower reliability estimates than the 3-months recall period.

Past research has demonstrated that several factors are likely to influence the reliability of self-report data, including the frequency of behaviors. Recall of less frequent behavior appears to lead to the use of more reliable recall strategies, such as enumeration (Bogart et al. 2007), fewer errors in recall (McLaws et al. 1990) and more reliable recall (Downey et al. 1995). Differences in the patterns of test-retest reliabilities across the recall periods for sex and drug-use behaviors may, in part, reflect differences in frequency of behavior. One limitation of the current study is that data were not available to directly test the hypothesis regarding the interaction between length of recall period and frequency of behavior on the accuracy of recall. Nor were enough data available to allow direct tests of whether some approaches to measurement were more reliable than others, for example, whether the use of the Timeline Followback (TLFB) differed in reliability compared to other methods. Further research is needed to address how different factors interact to influence reliability of recall. Research of this nature would allow researchers to better select a reliable recall period based on, for example, characteristics of the behavior (i.e., frequency, desirability), question format, or cognitive strategy that individuals are likely to employ. Although Jaccard and Wan (1995) have begun to explore one type of research paradigm that would address some of these issues, there continues to be a lack of research in this area.

Test-retest reliability provides one approach to examining the accuracy of self reports of HIV risk behaviors that is not without its limitations. Although consistent self-reports across two time points can result from accurate recall of behavior, it may also be due to recall of responses provided at the first administration of questions, or some combination of the two. Other approaches, such as diary methods, biological markers, or comparisons of drug-using or sex partners’ self-reports, have been used to address the reliability and validity of sex and drug-use self-report data (Darke 1998; Jaccard et al. 2002; Jaccard and Wan 1995; Stopka et al. 2004). To augment the examination of the reliability of self-reported HIV risk behaviors, these methods could be employed to examine the influence of length of recall period on the accuracy of self-report data (Graham et al. 2003).

The current study highlights the need for more data to be collected addressing the reliability of different recall periods. Many studies failed to use or report a specific recall period. This limitation makes it difficult to accumulate data on the influence of length of recall period. Other studies reported only the reliability of combined items, making it difficult to tease apart findings and examine the reliability of items measuring different types of drug use or sexual behaviors. The findings of the present study demonstrate that reliability may differ depending upon the particular drug or sex behavior being assessed, thus making it important to be able to examine the reliability of self-reports of these behaviors separately. The current paper draws attention to the lack of research addressing the optimal length of recall period for assessing self-reports of different HIV risk-related behaviors. For example, few studies investigated the test-retest reliability of anal sex or needle sharing behaviors.

The results of the current meta-analysis support the use of 3-months recall periods for self reports of sexual behaviors, including vaginal and oral sex. Further data are needed to examine whether a 3-months recall period may also provide a reliable approach for self-reports of anal sex. Self-reports of number of sexual partners were more reliable when longer recall periods were used, supporting previous research examining recall of sexual partners (Klinkenberg et al. 2002). Marijuana use was most reliably reported over a 30-days recall period, whereas crack/cocaine self-reports were more reliable over 3- and 6-months recall periods. The most appropriate recall period may depend on a combination of factors including the research question, or the manner of assessment of behaviors (McFarlane and Lawrence 1999). However, understanding what influence the length of recall period may have on the reliability of recall is important for making informed decisions for designing self-report measures.