Background

Alcohol use and associated consequences are a major public health problem, described as the third leading risk factor for poor health globally [1]. Recently, new revised guidelines from UK (United Kingdom) Chief Medical Officers advised adults about the likely harmful health effects of drinking more than 14 units/week [2], which is approximately six 175 ml glasses of (13%) wine, six 568 ml pints of (4%) lager or ale or (4.5%) cider or fourteen 25 ml measures of (40%) spirts (1 unit is 10 ml or 8 g of pure alcohol) in the UK [3]. The Global Burden of Disease Survey identified alcohol as a top five risk factor for non-communicable disease in the UK [4]. It is important that reliable and valid measures are used to monitor and assess alcohol misuse and related problems and, in turn, to inform public health strategies.

Our initial scoping exercise indicated that data about alcohol intake tends to be collected in surveys using one or more of the following three types of self-report questionnaires: Quantity-frequency measures ask questions about ‘usual’ alcohol drinking to estimate the frequency (e.g. number of days per week) and volume of alcohol consumed (e.g. ‘how many (cans/bottles/ glasses) were consumed on a typical drinking day’ [5,6,7]). Graduated-frequency questionnaires measure the volume of consumed alcohol by grouping the number of drinks per occasion into graduated categories, beginning typically with the highest amount consumed by a respondent and decreasing in pre-set categories (e.g. ‘During the last 12-months, how often did you have 12 or more drinks of any kind of alcoholic beverage in a single day?’ ‘During the last 12 months, how often did you have at least 8 but less than 12 drinks of any kind of alcoholic beverage in a single day?’ [8, 9]). Short-term recall measures ask respondents to recall the alcohol that they consumed within a predetermined timeframe such as during the previous week or the last 24-h (e.g. the ‘Yesterday’ method) or using a diary to record all alcohol consumption over a period of time [10, 11].

There is a need to ensure that survey instruments discern accurately alcohol consumption in order to identify the population of drinkers who consume over 14 units of alcohol per week [2], or misuse alcohol. In this review alcohol misuse is defined as ‘drinking excessively – more than the lower-risk limits of alcohol consumption’ [12]. Gmel [13] conducted a literature review of self-report measures (the quantity-frequency, graduated-frequency and short-term recall measures) compared to biological tests (i.e. blood alcohol concentration) using studies published in this field since 2004; and Feunekes [14] conducted a systematic review of studies published 1984–1999 on the capacity of the quantity frequency, extended quantity frequency, retrospective diary, prospective diary, and 24-h recall measures, respectively, to classify individuals according to their alcohol intake. These previous reviews are outdated and not in keeping with advances in survey methodology and design concerning alcohol research or with public health guideline changes (such as the reduction in alcohol guidelines in the UK [2]). This paper presents the results of a systematic review of all relevant research evidence regarding the reliability and validity of different types of survey measures of self-reported alcohol consumption in the adult population. Reliability and validity in this review are defined by the COnsensus-based Standards for the selection of health Measurement Instruments (COSMIN) methodology [15]. COSMIN provided an iterative way of assessing the psychometric properties of included measures. The review adds to previous research by providing the first COSMIN-type review of alcohol intake measures as well as providing an updated review of the alcohol consumption measures. This review addressed the following questions:

Are self-reporting measures (the quantity-frequency, graduated-frequency and short term recall measures) reliable and valid in their assessment of alcohol consumption for the general population? If so, which of the self-reporting measures are most reliable and valid? Which measure most accurately identifies levels of alcohol consumption? The use of a reliable and valid measure in alcohol survey research will enhance the rigour and comparability of studies.

Methods

The review was reported in accordance with PRISMA guidelines (see checklist attached as Additional file 1) [16]. No protocol exists for this review. Study authors searched PUBMED (1966-present), MEDLINE (1946-present), EMBASE (1947-present), CINAHL (1937-present), PsycINFO (1887-present) and SSCI (1976-present) from their inception to 11th August 2017 for peer-reviewed articles. Search terms were based on a COSMIN search filter to identify studies of psychometric properties, combined with terms relevant to alcohol intake measures (Fig. 1).

Fig. 1
figure 1

Search strategy; List of free text terms and medical subject headings searched for using the conjunctions ‘AND’ or ‘OR’ to find articles which met the inclusion criteria using the online bibliographic databases

Eligibility criteria

Papers were included if they were English language peer-reviewed studies that evaluated the reliability or validity of survey measures of alcohol consumption that were ‘self-completed’ by adults aged ≥18 years via telephone, paper, computer or interview. Studies were included if they assessed the reliability or validity of self-report alcohol consumption measures (the quantity-frequency, graduated-frequency or short term recall measures or any variation of these measures). Studies were excluded if they did not focus on reliability or validity, were reviews of the literature or study participants had a mental or alcohol disorder diagnosis, were in receipt of treatment for alcohol misuse or were being cared for in a care institution. The review focused upon evaluating the psychometric properties of alcohol consumption measurement for the general drinking population; previous research indicates that people with an alcohol use disorder diagnosis tend to self-report differently from other drinkers (see discussion [17]). Studies were excluded also if they measured self-reported alcohol consumption using other methods only (biological testing or self-reporting alcohol tests).

Titles were exported to Refworks, duplicates were removed and titles and then suitable abstracts were screened and examined by HMcK, CT and MD independently. Cases of disagreement over study inclusion were resolved via review and discussion. Data collection from eligible studies involved extracting information about population characteristics, measures, results and COSMIN quality ratings onto an Excel spreadsheet (see Table 2). This was completed by HMcK and checked by other reviewers. Reference lists of literature reviews and citation lists of included studies were searched for relevant papers. The search strategy identified 806 studies after duplicate removal, 478 remained following examination of abstracts and 28 papers were included following full-text review (Fig. 2).

Fig. 2
figure 2

PRISMA flow diagram [16]; Flowchart depicting the process of searching, selecting and sifting studies according to eligibility criteria. The search stages were identification, screening, eligibility and inclusion

Quality assessment

Pairs of independent reviewers applied the well-validated COSMIN checklist to assess the methodological quality of included studies. Definitions of the psychometric properties are provided by COSMIN (see Table 1). Information (e.g. coefficients) on psychometric properties reported on each measure by included studies were assessed using the quality criteria COSMIN checklist created by Terwee [18] which generated ratings of good, moderate or poor. An additional methodological quality score was calculated for each psychometric property checklist using the ‘worst score counts’ method, where the lowest rating of any of the items in an individual psychometric property checklist is taken as the overall score for that property [19]. Risk of bias (where evidence reported by studies may not be trustworthy [20]) was accounted for by assessing methodological quality of studies. It is important to note that the review reported the properties that were recorded in the original articles and that most articles did not assess or report the full range of properties recommended by COSMIN.

Table 1 COSMIN definitions of domains, measurement properties, and aspects of measurement properties [18]

Results

Table 2 presents the characteristics and results from the 28 papers that met inclusion criteria. It acts as a summary of the content from Additional file 2: Tables S1 and S2 which are included as Additional files 2 and 3. Included studies reported drinks/alcohol measures in standard sizes for the country of publication (see Additional file 2: Table S1). Some studies included beverage specific measures. Studies were conducted in the USA (n = 18), Australia (n = 4), Canada (n = 2), Finland (n = 2), UK (n = 1) and the Netherlands (n = 1). Most studies included short-term recall measures (n = 21), quantity-frequency measures (n = 14) and graduated-frequency measures (n = 11). Convergent validity (n = 15), criterion validity (n = 14), test-retest reliability (n = 10), predictive validity (n = 9), inter-rater reliability (n = 5), hypothesis validity (n = 4), construct validity (n = 2), divergent validity (n = 2), and structural validity (n = 1) were assessed across the studies. Some studies assessed the psychometric properties of more than one measure and measure type but not one study assessed all COSMIN psychometric properties.

Table 2 Summary of characteristics and psychometric properties for included studies

Methodological quality assessment

There was wide variation in methodological quality ratings for each psychometric property (as presented and discussed below).

Quantity-frequency measures achieved criterion validity ratings of excellent (n = 1), fair (n = 1) and poor (n = 2). Test-retest reliability quality ratings were good (n = 1), fair (n = 1) and poor (n = 2), with inter-rater reliability rated fair (n = 1) and poor (n = 1). Convergent validity ratings were good (n = 1) and fair (n = 2). Hypothesis validity was rated good (n = 1) and fair (n = 1). Predictive validity was rated excellent (n = 1) and structural validity fair (n = 1).

The graduated-frequency measures achieved convergent validity ratings of good (n = 2) and fair (n = 3). Test-retest reliability ratings were rated fair (n = 2) and good (n = 1) and inter-rater reliability was also rated fair (n = 1). Criterion validity was rated good (n = 1), fair (n = 1) and poor (n = 1). Predictive validity was rated excellent (n = 1), good (n = 1) and fair (n = 1). Divergent validity was rated fair (n = 1). Construct validity was rated fair (n = 1).

The criterion validity ratings for the short-term recall measures were excellent (n = 1), good (n = 1), fair (n = 1) and poor (n = 4). Convergent validity was rated good (n = 2) and fair (n = 5). Predictive validity was rated excellent (n = 1), good (n = 1), fair (n = 2) and poor (n = 1). Test-retest reliability scores were rated fair (n = 3), with inter-rater reliability also rated fair (n = 1). Hypothesis validity was rated good (n = 1) and fair (n = 1). Divergent validity was rated fair (n = 1) and construct validity was rated poor (n = 1).

Test-retest reliability

Quantity-frequency and graduated-frequency measures completed by a Finnish population sample [11] and a computer and paper administered quantity-frequency measure demonstrated good test-retest reliabilities [6]. Moderate test-retest reliabilities were reported for a quantity-frequency measure administered to a general population sample [21] and for quantity-frequency and short-term recall measures in an Australian general sample of twins [22]. Good test-retest reliability was reported in an undergraduate student population sample for a graduated-frequency measure [10] and in a general population [23]. Test-retest reliability of a daily intake short-term recall measure was good for an older adult sample [24]. Moderate test-retest reliability was reported for a short-term recall measure of ≥5 drinks consumed per drinking occasion [25]. In an older population sample, inter-rater reliability was good for quantity-frequency and short-term recall measures [26] though poor inter-rater reliability was reported in a study administering a weekly quantity-frequency measure to over 65-year olds [7] and for the graduated-frequency and short-term recall measures in a general population [27] (for detailed results see Table 2).

Criterion validity

Studies of quantity-frequency measures administered to the general population sample [28,29,30] and a quantity-frequency and short-term recall measure [31] demonstrated good criterion validity. An annual graduated-frequency measure and previous 24 h short-term recall measure administered in a general population sample indicated good criterion validity for ‘heavy drinkers’. Poor validity was reported for moderate drinkers in this study (due perhaps to the fact that consumers of lower levels of alcohol may drink irregularly and not within the 24-h before administration of the short-term recall measure) [27]. An undergraduate student sample completed two graduated-frequency measures and a short-term recall measure with moderate criterion validity [32]. Short-term recall spousal reports that were used as a criterion or standard to validate alcohol intake in an older sample reported good criterion validity [24]. A short-term recall measure administered to an undergraduate student sample had poor criterion validity [33] though other studies of the short-term recall measure [34] and the short-term recall and graduated-frequency measures [9] reported good criterion validity (see Table 2).

Construct validity

Poor construct validity was found for 30-day graduated-frequency measure completed in an undergraduate sample (age range 18–20 years) [35]. A short-term recall measure compared with the MAST measure on two separate occasions in a sample of older adults reported poor to moderate construct validity [24] (see Table 2).

Hypothesis validity

Good hypothesis validity was reported for a quantity-frequency measure compared to a short-term recall measure in an older adult population sample [26] and for a quantity-frequency measure compared to a short-term measure in a general population sample [36] (see Table 2).

Predictive validity

One study of a graduated-frequency and short-term recall measure that was completed by an undergraduate student sample demonstrated adequate to good predictive validity [9] whilst another (albeit small sample size) study of the same measures in an undergraduate student sample (age range 18–20 years) recorded poor predictive validity [32]. A general population study found poor predictive validity for the three measures [37] though measured against unstandardized indicators of alcohol-related mortality, morbidity and harm. A short-term recall measure achieved good or adequate prediction properties regarding heavy drinking (≥5 drinks per occasion) for samples aged 18–39 [25] and for a general population [38] (see Table 2).

Convergent validity

Moderate to good convergent validity was found in a general population sample for a two-week beverage-specific quantity-frequency measure, a graduated-frequency and short-term recall measure [39]. Similarly, adequate or good convergent validity was recorded for the three types of measures of alcohol intake in a cohort of 20 to 63-year olds [11] and in a general population [37]. A graduated-frequency and short-term recall measure demonstrated good convergent validity in an undergraduate student samples [8, 10]. A short-term recall measure completed by undergraduate student samples reported adequate to good convergent validity [40]. Also, adequate convergent validity was found for short-term recall measures in a male population sample [41] (see Table 2). Only one study referred to divergent validity of the graduated-frequency and short-term recall measures and only in terms of a negative correlation in an undergraduate student sample between religiosity and alcohol consumption [10] (see Table 2). Similarly, only one study referred explicitly to structural validity - a 30-day quantity-frequency measure that was used to collect data on alcohol consumption in a general population reported poor validity [42] (see Table 2).

Overall, the review found that only a relatively small number of studies investigated the COSMIN psychometric domains of each type of measure. Furthermore, the hypothesis validity or structural validity of the graduated-frequency measure was not investigated at all nor was the structural validity of the short-term recall measure. Divergent validity or construct validity were not assessed for the quantity-frequency measure.

Discussion

Psychometric property ratings for measure types

Each type of measure appeared to have good criterion validity according to COSMIN methodology. Several different reference standards or criterions were used in the included studies to measure alcohol consumption (e.g. [9, 29]). The appropriateness of using peers [34], spousal reports [24] and short-term recall measures [31] as criterion standards is questionable and perhaps it is unsurprising that these studies reported a low quality rating (despite reporting good content validity). Currently, there is no gold standard for the measurement of alcohol consumption. Most countries use some standard unit of measurement (e.g. one drink, one unit) but there is a lack of consensus and no internationally accepted definition thereby posing difficulties for the conduct of comparative analyses. Biological markers of alcohol consumption should be used more frequently to support and validate findings from self-reporting measures, as these methods are not subject to sampling errors or researcher or participant bias [14]. However these measures are also not without risk of error. Alcohol abstinence in the 24 h prior to breath-, blood- or urine- ethanol measurement has been shown to produce low results even for heavy drinkers [43]. More research is needed to find a gold standard for alcohol consumption measurement.

Construct validity was poor for graduated-frequency and short-term recall measures, and not assessed for quantity-frequency measures. The structural validity of the quantity-frequency measure only was assessed and this construct validity-related property was deemed to be poor. Only one study investigated the predictive validity of the quantity-frequency measure and it found that the validity was poor. Poor predictive validity results suggest the measure may not be valid in predicting the measurement of future alcohol intake among the general population or in predicting the measurement of drinking trajectories and alcohol-related consequences. The study was conducted with good methodological quality and received a good COSMIN score.

In contrast, the graduated-frequency and short-term recall measures achieved mixed results including predicting with variable accuracy the outcomes of alcohol-related morbidity and mortality and alcohol dependence. There were several studies of the convergent validity of each measure and generally this property was deemed to be moderate to good.

Test-retest results tended to indicate that similar outcome-assessments of alcohol consumption were found when the quantity-frequency measure, graduated-frequency measure and the short-term recall measure were re-administered. Mixed results were reported for inter-rater reliability of quantity-frequency and short-term recall measures, with poor inter-rater reliability found when the graduated-frequency measure was applied. In particular, there appeared to be difficulty obtaining good agreement between raters regarding the measurement of consumed beer, wine and liquor respectively [27], between self-report tests (AUDIT (Alcohol Use Disorders Identification Test [44]) and CAGE (Cut down, Annoyed, Guilty, Eye-opener) [45]) and a quantity-frequency measure when research assistants interviewed participants using a face-to-face predetermined appointment schedule [7]. It is important to note that these studies achieved only fair or poor COSMIN ratings. Indeed, many of the reported poor psychometric properties may be due to poorly conducted studies as indicated by poor COSMIN ratings [6, 21, 31]. Variation between types of psychometric properties for the same measure (e.g. high validity for one property and low for another property) may be due to differences in study design and methodological quality.

Discrepancies between COSMIN ratings and psychometric properties

There were some studies in which there were discrepancies between COSMIN ratings of the quality of a psychometric property and the performance of a measure. For example, one study [6] reported good test-retest reliability for a typical weekly quantity-frequency measure but the methodological quality of a particular aspect of the study was rated poor because the method of administering the (computer or paper) measure of consumption was not consistent across time-points. Reasons for poor methodological quality ratings using the COSMIN checklist included inappropriate time intervals between measure administrations, ambiguity over management of missing responses, lack of assurance that patients remained stable between measure administrations, inadequate sample size and choice of inappropriate statistical methods (e.g. reporting Spearman’s correlation coefficients [46] over kappa values for test-retest reliability).

Issues with self-reporting alcohol consumption

Self-reported alcohol consumption is difficult to measure accurately due to the influence of social desirability and memory issues and these factors were alluded to in many included studies (e.g. [25, 27, 32, 35]). Possible solutions to these challenges include using more anonymised interview types, randomised response techniques, checking responses using more than one alcohol measure and using memory aids (interviewer prompts, calendars or diaries) [47]. Also, population-based survey research about alcohol consumption and drinking habits are particularly problematic when the sample includes alcoholics because of uncertainty about whether or not participants are sober when interviewed, difficulty recalling consumption due to the effect of alcohol on memory and increased alcohol tolerance in frequently heavy drinkers [48]. These issues pose challenges for the reliable and valid assessment of alcohol consumption in surveys. Potential solutions include factoring in more complex survey questions requiring greater reflection on alcohol intake (if respondents are asked to consider the timing, type of beverage drank and episodic heavy drinking their responses should be more considered), [17] use of a breathalyser before measure administration to ensure participants are alcohol-free [49] and creating an environment that is conducive to confidentiality and honest disclosure of alcohol consumption [48, 50]. These potential solutions may be incorporated into population-based survey collection of alcohol consumption data in order to afford greater confidence in the drinking status of participants and significant assurance that responses reflect consumption accurately.

Comparison with previous reviews

Generally, the measures did not appear to vary significantly across population age and sex groupings. The assessment of the amount of alcohol consumed appeared to exert some influence on the psychometric performance of self-report measures. Parker [27] reported good concurrent validity using a short-term recall measure though for heavy drinkers only. Gmel [13] found the graduated-frequency measure over reported alcohol intake, whereas the beverage specific quantity-frequency measure provided a more accurate measure of consumption. The Feunekes review recommended that the quantity and frequency of alcohol consumption should be prioritised and assessed separately for specific types of alcoholic beverages [14] and beverage-specific quantity-frequency measures performed accurately and reliably though only in relation to the consumption of lower levels of alcohol [26, 28]. The use of a ‘diary’ format with a predetermined timeframe (that afforded individuals an opportunity to record all alcohol consumption in a format of their choice; and usually in the format of a short-term recall measure) had good psychometric properties [24, 29]. This finding may suggest that the use of an ‘actual’ time period instead of the ‘usual’ timeframes in quantity-frequency and graduated-frequency measures [51] may add to the reliability and validity of assessments of alcohol consumption. However both reviews found that the quantity-frequency measure performed with most reliability and validity and was the measure with the highest concordance with the short-term recall ‘diary’ measure [22, 29, 33, 38].

Recommendations for improved reliability and validity

The review findings suggest that the reliability and validity of self-reporting alcohol consumption measures may be improved in various ways. For example, computerised or automated modes of administration rather than an interviewer-based mode might facilitate greater privacy and assure more candid reporting [52]. Longer timeframes may be more desirable as they tend to capture less frequent drinkers (i.e. weekly, monthly or annual recall) and questions which involve specified timeframes (i.e. last week, last year) over ‘usual’ reference frames require respondents to focus their recall. Beverage-specific questions and questions that ask respondents to group responses into graduated categories may encourage a more thorough consideration of their alcohol consumption and, in turn, produce more accurate reporting. It is worth considering that the self-report measures themselves are outdated as they focus only upon frequency and volume of alcohol. It may be worthwhile to instead use self-report tests to assess alcohol consumption which take into account symptoms of alcohol addiction/dependence as well. Using review findings, the advantages and disadvantages of each measure type are summarised (Table 3).

Table 3 Summary table of the advantages and disadvantages of the quantity-frequency, graduated-frequency and short-term recall measures

Limitations and strengths

The review found wide variation in the structure, content and format of quantity-frequency, graduated-frequency and short-term recall measures. For example, time-period referents ranged from 24-h recall to alcohol intake over the previous year and alcohol consumption was assessed in terms of units (standardised to the country of each sample of respondents), grams of alcohol, typical sizes of sold drinks and beverage-specific drinks. The included studies from various multidisciplinary databases covered a range of locations, cultures and populations and these factors were taken into account in the analytical comparisons of measures of alcohol consumption. It is important to note that a proportion of the review studies focused on undergraduate student populations (e.g. [8, 10, 34, 40]). Arguably, students may be atypical with respect to the general population [53] and their alcohol consumption patterns may have limited read-across to the general population particularly the population of older people. Some psychometric properties were not assessed including measurement error, cross-cultural validity, internal consistency and responsiveness. All studies were in the English language (in keeping with COSMIN manual guidelines) and it is possible that important studies in other languages may have been missed. The review adhered to the COSMIN manual [15] and whilst the COSMIN method adds rigour to the exercise of psychometric assessment, arguably, a limitation is the use of the ‘worst score counts’ which means that despite attaining higher quality scores on some items, the lowest score of an item list is taken as the overall quality rating (e.g. [28, 31]). Furthermore, studies of poor design quality were included in the review due to the overall lack of studies that met initial eligibility criteria.

Nevertheless, the review was completed in a methodologically robust fashion as per the COSMIN approach which has transparent, tested and validated resources such as a manual, search filters and a quality appraisal tool [15]. Particular strengths include the use of extensive search terms and having two reviewers search the literature.

Conclusion

The studies of quantity-frequency measures indicated good/adequate psychometric properties for test-retest reliability, criterion validity, convergent validity and hypothesis validity; predictive- and structural-validity were rated as poor and inter-rater reliability reported mixed results. Regarding graduated-frequency measures, good/adequate psychometric properties were reported for test-retest reliability, convergent validity and divergent validity; criterion validity and predictive validity reported mixed results and construct validity and inter-rater reliability were reported as poor. Short-term recall measures achieved good/adequate psychometric properties for test-retest reliability, convergent validity, hypothesis validity, construct validity, divergent validity. Criterion validity, predictive validity and inter-rater reliability reported mixed results. The review findings add to previously published alcohol self-report literature by providing an updated appraisal of measures of alcohol consumption research and indicate that a combination of aspects of the various measures may enhance the reliable and valid assessment patterns of drinking.

It is difficult to discern which one of the existing measures is the most reliable and valid given the absence of any assessment of certain psychometric properties and the mixed results of studies included in the review. Arguably, when the results from the range of studies are considered and summed, they indicate that the quantity-frequency measure compared to the other two measures appeared to perform best in psychometric terms and, therefore, it is likely to produce the most reliable and valid assessment of alcohol consumption in population surveys. The results indicated that the features of alcohol consumption measures which performed with good reliability and validity were those that assessed beverage-specific alcohol consumption, used actual timeframes and asked about episodes of binge drinking; and that the quantity-frequency measures appeared to be the ‘best’ questionnaire-type currently available to measure self-reported alcohol consumption. Clearly, there is a need for more focused psychometric studies of measures of alcohol consumption including head-to-head comparative population-based and community surveys. Comparability of review results with previous reviews [13, 14] is difficult because they did not employ a COSMIN methodology to appraise studies. Overall, findings appeared to be in keeping with the results of the Gmel review [13] which found a beverage-specific, quantity-frequency measure recorded alcohol consumption more reliably, and with the Feunekes [14] which reported that the most accurate alcohol intake measurement was provided by quantity-frequency and short-term recall measures.