Background

Physical inactivity is considered to be one of the four leading risk factors for global mortality [1]. The measurement of physical activity is a challenging and complex procedure. Valid and reliable measures of physical activity (PA) are required to: document the frequency, duration and distribution of PA in defined populations; evaluate the prevalence of individuals meeting health recommendations; examine the effect of various intensities of physical activity on specific health parameters; make cross-cultural comparisons and evaluate the effects of interventions [2].

Physical activity questionnaires (PAQs) are often the most feasible method when assessing PA in large-scale studies, likely because of their low cost and convenience but these instruments have limitations and should be selected and used judiciously. PAQs are prone to measurement error and bias due to misreporting, either deliberate (social desirability bias) or because of cognitive limitations related to recall or comprehension [3, 4]. Cognitive immaturity or degeneration can make self-report of physical activity particularly difficult in the young and elderly [5, 6]. Despite more frequent use of objective assessment methods to measure physical activity, PAQs still provide a practical method for PA assessment in surveillance systems, for risk stratification and when examining etiology of disease in large observational studies. Most PAQs are designed to be able to measure multiple dimensions of PA by reporting type, location, domain and context of the activity, provide estimates of time spent in activities of various levels of intensity, and may be able to rank individuals according to intensity levels of reported activity [7, 8]. However, results from studies aimed at evaluating the validity of PAQs assessed in one population cannot be systematically extrapolated to other populations, ethnic groups, or other geographical regions. Consequently, a great variety of PAQs have been developed and tested for reliability and validity in recent years.

A comprehensive review of PAQs for use in adults was published in 1997 [9]. Since then, reviews summarizing the validity and reliability of PAQs have been carried out in children [1012] and preschoolers [13]. Recently, specific reviews were published assessing the quality of PAQs available for children [11], adults [14] and the elderly [15]. The aim of the present study was to systematically review the literature on reliability of PAQs as well as their validity evaluated against objective criterion methods, for use in all age groups, published between January 1997 and December 2011 to quantitatively compare the performance between existing and newly developed PAQs.

Methods

Inclusion criteria

Studies meeting all of the following inclusion criteria were included: (i) published in the English language between January 1997 and December 2011; (ii) self- or interviewer-administered PAQs or parental proxy reports reporting both reliability and validity results; (iii) PAQs reporting validity results only, when the reliability data has been published previously; (iv) PAQs developed for a healthy general population and for observational surveillance studies; (v) PAQs tested in its original form or in an adapted version if results were reported for validity and reliability or validity only, when reliability results were published before; (vi) validity tested against an objective criterion measure of PA (i.e. accelerometry, heart rate, combined heart rate and accelerometry, doubly labeled water (DLW)); (vii) results on validity obtained by pedometer where the questionnaire was specifically developed to assess walking only.

Exclusion criteria

We excluded studies that reported: (i) reliability and validity results in groups with specific clinical or medical conditions (except pregnancy); (ii) results from PAQs that were designed for specific intervention studies; (iii) results where the validity of the PAQ was tested against another self-report method (i.e. diaries, logs); (iv); results on validity using pedometers (except if walking only was tested) and indirect measures of physical activity (e.g. VO2max and body composition); (v) results on essential adaptations of original PAQs, without any published results on both reliability and validity.

Literature search

The PubMed, Medline and Web of Science databases were systematically searched using the following lists and terms:

List A: (physical activity AND health survey OR population survey OR question*)

List B: List B: measure* (i.e. measures, measurement), assess* (i.e. assessment, assessed), self-report, exercise, valid* (i.e. valid, validation, validity), reliab* (i.e. reliable, reliability), reproducible, accelerometer, heart rate, doubly labelled water, doubly labeled water. The search included titles, abstracts, key words and full texts.

Key search terms in List A were combined with each of the terms in List B.

The literature search was undertaken in two stages. The original literature search (1997–2008) was undertaken by two of the authors (JW, HB) independently and search results were compared and verified. The literature search was then updated to include studies up to December 2011 using exactly the same search criteria (HH). A second search strategy included screening references lists of publications that matched the inclusion criteria and any other publications of which the authors were aware but did not show up during the original literature search. Figure 1 displays an overview of the literature search.

Figure 1
figure 1

Overview of the literature search.

Data collection and extraction

Data were extracted using a standardized pro-forma which included sample characteristics, questionnaire details, methods of validity and reliability testing, test results and authors’ conclusions. We retrieved full text of articles of all abstracts that met our inclusion criteria. Any queries about the inclusion of papers were resolved by one of the authors (UE).

Reliability

Reliability in all studies was tested through a test-retest procedure to measure consistency of the PAQs. Reliability results from included studies were reported as: intraclass correlation coefficients (ICC); Pearson and Spearman correlation coefficients; and agreement measures using Cohen’s weighted kappa (κ) and mean differences. Reliability was considered poor, moderate (acceptable), or strong when correlation coefficients or kappa statistics were <0.4, 0.4–0.8 or >0.8, respectively [16]. Similarly, an ICC > 0.70 or >0.90 was considered as acceptable and strong, respectively, in those studies reporting this measure [17].

Medians of reliability correlation coefficients across studies were calculated and included in the tables when possible.

Validity

Correlation coefficients were the most commonly used measures of validity, although the Bland-Altman technique [18] which determines absolute agreement between two measures expressed in the same units, was also frequently used. The Bland-Altman method estimates the mean bias and the 95 % limits of agreement (± 2SD of the difference) and is usually plotted as the difference between the methods against the mean of the methods for visual inspection of the error pattern throughout the measurement range; the dependence of error with the underlying level can be summarised in the error correlation coefficient but this was only seldom reported.

Medians of included validity correlation coefficients were calculated and included in the tables when possible. When calculating the medians, we excluded those studies reporting correlation coefficients for the associations of self-reported sedentary time. The medians for sedentary time are reported separately and associations of sedentary time with measures of total physical activity (i.e. total energy expenditure [TEE], physical activity level [PAL] and total activity from accelerometry [mean counts]) from the criterion method were excluded in these analyses as these measures are expected to be inversely related.

Classification

Questionnaires were classified as new or existing (i.e. previously published test results) PAQ. Existing questionnaires were subdivided into those which reported new reliability and validity results, and those which reported new results on validity only but had previously reported results on reliability. Questionnaires were classified as new, when the concerning study was the first to publish reliability and objective validity data on the PAQ. Hereafter, studies were further stratified for age group of the sample. Study populations with a mean age lower than 18 years were categorised as youth, 18 – 65 years were classified as adults, and elderly above 65 years.

PAQs included

PAQ abbreviations are listed in Table 1, with their respective timeframe. The details of these studies are shown in Tables 2 (new PAQs) and 5 (existing PAQs). A range of tests were used to assess reliability and validity with some studies reporting results for a total questionnaire summary score, and others assessing reliability and validity for various aspects, intensities, or domains of the questionnaire and/or by subgroups within the test population. The total score or index for the PAQ was reported, if available. In the absence of a total score, correlation coefficients by intensity category or group are reported. Where multiple results were reported, a decision was made about the data that constituted the main results based on the stated objectives for the study or questionnaire. Several studies compared results to another questionnaire concurrently but if this was a secondary aim of the specific study, the results were not included.

Table 1 List of questionnaire abbreviations and the corresponding definitions
Table 2 Descriptive characteristics of new PAQs

Results were reported for both total score and other aspects (e.g. domain, intensity) when this substantially added to the information for the specific study, for example when total PA was tested against a different validation method than PA intensities [31]. Some questionnaires assessed sedentary behaviour and these results are specifically reported in the tables or text. Sedentary behaviour has recently been suggested to be considered distinctively from physical activity in associations with health outcomes [50].

Results

The search string (JW and HH) resulted in a total of 11098 hits. The first literature search resulted in 125 papers being retrieved for data extraction. The update of the literature review to December 2011 resulted in a further 75 papers being retrieved for data extraction (Figure 1). More than half of the papers retrieved were excluded (n = 104). The main reasons for exclusion were inappropriate criterion measures, generally a measure of aerobic fitness (n = 48), and lack of information on reliability (n = 26) or validity (n = 17) (Figure 1).

New PAQs

The description of newly developed PAQs is summarized in Table 2. The literature search found 31 articles, reporting results from 34 newly developed PAQs of which 10 were from the United States, 10 from Europe, six from Australia, two from Canada, and one study from Japan and Sub-Saharan Africa, respectively. Of note was a 12–country international study testing the International Physical Activity Questionnaire (IPAQ) [34]. This questionnaire is available in a short form for surveillance and in a longer form when more detailed physical activity information is collected. Both forms are available in a number of languages. IPAQ has been rigorously tested for reliability and validity and this has been replicated in a number of countries.

Nineteen studies tested the reliability and validity in adults, an additional 11 studies focused on youth [1929] and one study was performed in Japanese elderly (n = 1) [49]. Most studies (n = 25) included men and women, four studies [26, 30, 32, 35] reported data in women and two studies [37, 38] in men only. The number of participants varied from 30 to 2271, and several studies [19, 20, 29, 31, 3335, 3941, 4347] performed reliability testing in a larger sample than their test of criterion validity. The most common response timeframe was the last seven days, with seven studies [27, 30, 36, 37, 44, 46, 47] using a timeframe covering the last year (Table 1). All PAQs captured some elements of leisure time and recreational activity, although most questionnaires also addressed multiple domains of activity. Sedentary time is also a commonly captured behaviour from the newly developed questionnaires and has been given some extra attention in recent publications and in the current results. Several recent PAQs, such as the EPIC Physical Activity Questionnaire (EPAQ2) and the Recent Physical Activity Questionnaire (RPAQ), aim to measure the totality of physical activity by domains [31, 46, 47, 51]. The final outcome of the majority of PAQs was reported as time-integrated MET values, e.g. MET-min/week.

Reliability

All reliability results for new PAQs are listed in Table 3.

Table 3 Reliability results of new PAQs

Reliability was usually reported as ICC (n = 13), Pearson/Spearman correlation (n = 6), kappa statistic (n = 3) or a combination of these statistics (n = 9). Higher reliability coefficients were more often seen in association with shorter periods between test and retest. Poor correlation (ICC or r <0.4) was found only in subcategories of a few PAQs. Median correlations from reported data for recall of sedentary behaviours across all PAQs were acceptable: ICC = 0.68, Spearman r = 0.60, Pearson r = 0.475, kappa = 0.66.

Youth

Median reliability correlations for the youth were as follows: ICC = 0.69, Spearman r = 0.71, Pearson r = 0.80, kappa = 0.53. The Activitygram (ICC = 0.24) [26] and the self-reported CLASS questionnaire (frequency: ICC = 0.36, duration ICC = 0.24) [25] showed fairly low reliability correlations, whereas the MARCA (ICC = 0.93) [52] and both computer and paper versions of the CDPAQ (ICC = 0.91–0.98) [23] demonstrated high reliability.

Adults

Median reliability correlations for adults were as follows: ICC = 0.765, Spearman r = 0.75, Pearson r = 0.74, kappa = 0.655. Reliability was poor for the AQuAA score for adults (ICC = 0.22) [53]. Similarly, reliability coefficients were poor for the HUNT2 [37] components of light (r = 0.17, κ = 0.20) and hard activity (r = 0.17, κ = 0.41). The primary version of this questionnaire (HUNT1), which was designed a decade earlier, however demonstrated high reliability (r = 0.76–0.87, κ = 0.69–0.82) [54]. The majority of the questionnaires showed acceptable to good reliability: KPAS (ICC = 0.82–0.83) [30], RPAQ (ICC = 0.76) [31], PPAQ (ICC = 0.78) [32], IPAQ short (r = 0.76) and long version (r = 0.81) [34], AWAS (ICC = 0.73–0.80) [35], FPACQ (ICC = 0.68–0.80) [22], OPAQ (ICC = 0.78) [42], SBQ (ICC = 0.77-0.85, r = 0.74-0.79) [43], SPAQ (r = 0.998) [39] and SSAAQ (r = 0.95) [44].

Elderly

Median Pearson reliability correlation for the elderly was r = 0.70. The PAQ-EJ was the only new PAQ designed for (Japanese) elderly that reported reliability results and has acceptable recall properties (r = 0.70) [49].

Validity

All validity results for new PAQs are listed in Table 4.

Table 4 Validity results of new PAQs

Accelerometry and in particular the ActiGraph accelerometer was the most commonly used criterion method (n = 19), followed by the Caltrac accelerometer (n = 4) and the Polar heart rate monitor (n = 4). DLW was used in one study, where absolute validity was moderate to high for PAEE (r = 0.39) and TEE (r = 0.67) [31]. In general, validity coefficients were considerably lower than reliability coefficients. Median correlations across all PAQs between reported sedentary behaviours and calculated inactivity from objective measures were low: Spearman r = 0.12.

Youth

Median validity correlations for the youth were as follows: Spearman r = 0.22, Pearson r = 0.41. CLASS self- and parental reported physical activity (r = −0.04–0.11) [25] was among the least valid questionnaires for children, although several other PAQs also showed low correlations with objective measures: Pre-PAQ (r = −0.07–0.17) [19], BONES PAS (r = 0.23–0.27) [20], GAQ (r = 0.27–0.29) [26], Fels PAQ (0.11–0.34) [27]. None of the newly developed PAQs for children demonstrated high validity.

Adults

Median validity correlations for adults were as follows: Spearman r = 0.27, Pearson r = 0.28. Highest validity in adults was demonstrated for the SSAAQ when tested against the Caltrac accelerometer (r = 0.60-0.74) [44]. Low validity correlations for total activity or for all subcategories were observed for the HUNT1 (r = 0.03–0.07) [54], and the short EPIC PAQ (r = 0.04), although the main outcome, a 4 category physical activity index, derived from this instrument was significantly associated with objectively measured physical activity energy expenditure (p for trend = 0.003) [47]. A follow-up study in 1941 adults from 10 European countries suggested moderate validity (r = 0.33) of this instrument using physical activity energy expenditure from combined heart rate and movement sensing as the criterion [51].

Rosenberg et al. assessed the validity of sedentary behaviour only, and demonstrated low correlations (partial r = −0.01–0.10) with objectively measured sedentary time (<100 counts/min) by the ActiGraph accelerometer [43].

Elderly

Median Spearman validity correlation for the elderly was r = 0.41. The PAQ-EJ was tested by correlating a total score with MET-min/day calculated from the Kenz Lifecorder accelerometer-based pedometer (r = 0.41) [49].

Existing PAQs

New validity and reliability results for existing PAQs were reported in 35 studies, and 30 studies reported new results on validity only (Table 5). One study is classified as a study testing an existing PAQs, but also reports both validity and reliability data for a new PAQ (SP2PAQ) [55]. Twenty-six of the 65 studies were undertaken in the US with the remaining coming from Australia (n = 5), Sweden (n = 5), China (n = 4), Belgium (n = 3), Spain (n = 3), Canada (n = 2), France (n = 2), Norway (n = 2), Japan (n = 2), Brazil, Portugal, Singapore, South Africa, Turkey, United Kingdom and Vietnam. There were four multi-country studies; three testing the IPAQ modified for adolescents [56, 57] and the EPAQ-s in 9–10 European cities [51]. The GPAQ was tested in diverse sample of nine global countries [58]. Eighteen studies were undertaken in youth [57, 5974], 12 in elderly [7586]; and 35 in adults with a few studies including both older adolescents and adults. In 48 studies men and women were combined, 10 studies examined women only [70, 72, 8793], and seven studies included only men [54, 75, 78, 9497]. All authors concluded that the questionnaires had shown at least satisfactory results for reliability and validity (see results below); seven studies noted considerable limitations in aspects of their questionnaires [56, 59, 63, 90, 98100].

Table 5 Descriptive characteristics of existing PAQs

Reliability

All reliability results for existing PAQs are listed in Table 6.

Table 6 Reliability results of existing PAQs

Most studies examining the reliability of existing PAQs reported reliability as ICC (n = 20), Pearson/Spearman correlation coefficients (n = 8); some studies also used a combination of correlation statistics (n = 7). Similar to the new PAQs, the existing PAQs demonstrated moderate correlations for reliability. Median correlations from reported data for recall of sedentary behaviours were divergent: ICC = 0.76, Spearman r = 0.725, Pearson r = 0.305, kappa = 0.645.

Youth

Median reliability correlations for the youth were as follows: ICC = 0.64, Pearson r = 0.605. The CHASE (ICC = 0.02) and the CPAQ (ICC = 0.25) showed poor test-retest reliability, whereas the reliability was strong for YPAQ (ICC = 0.79–0.86) in the same study [61]. Previous day physical activity recall instruments proved to be highly reliable in children (ICC = 0.98 [60], r = 0.98 [74]).

Adults

Median reliability correlations for adults were as follows: ICC = 0.79, Spearman r = 0.64, kappa = 0.655. The IPAQ-SALVCF (ICC = 0.929) [105], IPAQ long version (r = 0.87–0.90 [108], ICC = 0.93 [110]), IPAQ short version (ICC = 0.79) [99], FPACQ (ICC = 0.77–0.96) [111], KPAS-mod (ICC = 0.76–0.84) [92] and the JPAC (ICC = 0.99) [113] showed acceptable or strong reliability. Notably, the IPAQ-s showed a wide range of results for reliability, with ICCs ranging from 0.27–0.97 for sitting [54, 69, 83, 85, 99, 103, 112], 0.10–0.42 for walking [54, 69], 0.30–0.34 for MPA [54, 69], 0.30–0.62 for VPA [54, 69], and 0.33–0.79 for total PA [83, 85, 99, 103, 112]. For sedentary time the short IPAQ appeared to be the most reliable questionnaire when the test retest duration was short (i.e. 3 days, [ICC = 0.97]) [99]. All existing PAQs for adults reported acceptable to high reliability properties, overall.

Elderly

Median reliability correlations for the elderly were as follows: ICC = 0.65, Spearman r = 0.60, Pearson r = 0.62. Similarly, all existing PAQs for elderly also showed overall acceptable to high reliability, with the PASE (ICC = 0.91) [77], 7DPAR (ICC = 0.89) [78] and CHAMPS-MMSCV (ICC = 0.81–0.89) [79] performing best.

Validity

All validity results for existing PAQs are listed in Table 7.

Table 7 Validity results of existing PAQs

Of the 65 studies that report new results for the validity of existing questionnaires, 14 studies [55, 61, 69, 75, 81, 83, 84, 87, 89, 91, 94, 96, 97, 103] tested two or more questionnaires. Forty-five studies used accelerometry as the criterion, and the remaining used DLW (n = 8) [71, 75, 84, 89, 93, 94, 96, 116], pedometry (n = 3) [79, 101, 105], HR monitoring (n = 1) [104], MiniLogger (n = 1) [81] or a combination of methods (n = 5) [51, 60, 61, 74, 114]. Spearman and Pearson correlations were the most commonly used statistical measures for assessing validity; four studies reported 95 % confidence intervals with these correlations [51, 102, 103, 112] and three studies solely reported results using the Bland-Altman levels of agreement method [84, 94, 104]. Median correlations between reported sedentary behaviours and inactivity from objective measures were calculated: Spearman r = 0.23, Pearson r = 0.435.

Youth

Median validity correlations for the youth were as follows: Spearman r = 0.25, Pearson r = 0.38. Many PAQs (SAPAC [59], HBSC [54], IPAQ-s [54], GSQ [70] and GAQ [118]) demonstrated low validity coefficients (r < 0.2) in youth and only one instrument (PDPAR [60]) was regarded as highly valid (r = 0.76) when compared with physical activity assessed by the Caltrac accelerometer.

Adults

Median validity correlations for adults were as follows: Spearman r = 0.30, Pearson r = 0.46. Validity correlations were generally low for most PAQs, except for the FPACQ [111] compared with accelerometry in multiple subcategories (r = 0.39–0.85) and the BAQ (r = 0.68–0.69), FCPQ (r = 0.34–0.61) and TCQ (r = 0.63–0.64) for estimated TEE compared with TEE measured with the DLW method [96]. Pettee-Gabriel et al. compared five different PAQs with accelerometry from the Actigraph accelerometer and showed acceptable validity for all instruments; PMMAQ (r = 0.59–0.60), PWMAQ (r = 0.56–0.60), NHS-PAQ (r = 0.42–0.46), AAS (r = 0.46–0.50), WHI-PAQ (r = 0.45–0.47) [91]. Several studies, including the 7DR-O [87], MAQ [109], CAPS [89], IPAQ [55, 90] and the IPAQ-s [54, 98, 99], demonstrated poor validity.

Elderly

Median validity correlations for the elderly were as follows: Spearman r = 0.40, Pearson r = 0.345. Bonnefoy et al. tested the validity of 10 previously developed well known PAQs using DLW as the criterion measure [75]. The results of this study suggested that the Stanford Usual Activity questionnaire performed best (r = 0.63–0.65). Other studies in elderly generally found low correlations between self-reported PA with objective measures, also demonstrated by the generally weak performances of the YPAS in several studies (r = 0.11–0.61) [75, 76, 81, 83, 84], and PASE in one of the studies (r = 0.16–0.17) [80].

Discussion

This systematic review covered the most recent 15-year period. We identified 31 studies that adequately tested newly developed PAQs for both validity and reliability during this period. This suggests that whilst assessing physical activity by means of objective monitoring has become widespread also when examining population levels of activity [119121], PAQs remain an active area of research and are now generally considered complementary to any objective measure. Several previous reviews have assessed the reliability and validity of PAQs with a special focus on their overall performance [9], or performance in specific age groups [11, 14, 15]. Conversely, we compared whether newly developed PAQs performed better than older PAQs, as this will inform researchers and practitioners when choosing an existing PAQ or developing a new instrument for assessing physical activity. We therefore comprehensively summarized the results to allow an adequate appraisal of the existing PAQs performance across domains and physical activity intensities.

In concordance with previous reviews [11, 14, 15], very few questionnaires showed acceptable reliability and validity across age groups. Developing new PAQs requires careful consideration of the study design in terms of target population, sample size, age group, recall period, dimension and intensity of PA, relative and absolute validity, standardized quality criteria and appropriate comparison measures. The lack of formulating a priori hypotheses was recently highlighted as a limitation in most studies examining the validity of PAQs [11] and comprehensive key criteria for physical activity and sedentary behaviour validation studies have been proposed [122, 123].

Since the comprehensive review by Kriska and Caspersen [9], it is apparent that more appropriate criterion methods, in particular accelerometry, have been used to test the validity of PAQs. Yet, a considerable number of studies were excluded from the present review due to an inappropriate criterion method (e.g. aerobic fitness). Many studies reported reliability and validity results for existing and well established questionnaires, which suggests that these instruments are still frequently used. Importantly, newly developed PAQs do not seem to perform any better than existing instruments in terms of reliability and validity. Unfortunately, we were not able to conduct a formal meta-analysis due to differences in reported outcomes, different criterion measures and different time frames between questionnaires.

Total energy expenditure (TEE) was frequently used as the outcome measure of the PAQ and the validity scores from these types of instruments are usually high. However, the results from many of these studies should be interpreted carefully. This is because TEE from any self-report incorporates an estimate of resting energy expenditure (REE) generally calculated from body weight, sex and age. REE explains most of the variation in TEE and, consequently, high correlations may be generated when comparing TEE from self-report with measured or estimated TEE from the criterion method. This is particularly problematic when those same predictions of REE are used by both the criterion method and the self-reported calculation of energy expenditure. Therefore, other outputs (e.g. time spent in different intensity levels, physical activity energy expenditure normalised for body size) from the criterion method appear more appropriate to serve as criterion measures. In these studies correlations between the criterion measure and self-reported PA are considerably weaker than those for TEE, although the concerning PAQs may still be considered valid as demonstrated in some studies [31, 116]. The notion of validity, however, is a matter of degree, rather than an all-or-nothing determination.

The validity correlation coefficients from the vast majority of existing and newly developed PAQs were considered poor to moderate and usually only acceptable when results were presented as Pearson or Spearman correlation coefficients. This suggests that most PAQs may be valid for ranking individuals’ behaviour whereas their absolute validity is limited to quantify PA. Although our summary of the correlations in a single median value should be interpreted with caution, we did not observe any substantial difference between newly and existing PAQs. This may suggest that, despite considerable effort, accurate and precise self-report physical activity instruments are still scarce [124]. Many of the newly developed instruments collected information in various domains of physical activity including transportation and housework. Despite this, it appears almost impossible to obtain a valid estimation of a highly variable behaviour such as free-living physical activity by self-report. While results from large scale observational cohort studies have convincingly demonstrated the beneficial effects of self-reported physical activity on various health outcomes including all-cause mortality, coronary and cardiovascular disease morbidity and mortality, some types of cancer, and type 2 diabetes, the detailed dose–response associations are still unknown [125]. Increased sample size is usually considered to improve precision but may not overcome issues about accuracy. Further, a large sample size does not overcome misclassification due to differential measurement error. Therefore, future studies should consider including an objective measure of physical activity in addition to self-report or consider recommendations to reduce self-report error [126].

With few exceptions, most PAQs reviewed showed acceptable to good reliability with only minor differences between existing and newly developed PAQs. The median reliability correlations were acceptable to good in youth (0.64 – 0.65), adults (0.64 – 0.79), and the elderly (0.60 – 0.65) for existing PAQs; and marginally higher for newly developed PAQs in youth (0.69 – 0.80), adults (0.74 – 0.765), and the elderly (0.70). However, only 3 of 11 newly developed PAQs [21, 23, 24] showed consistently good reliability.

For existing PAQs, median validity correlations were poor to acceptable in youth (0.25 – 0.38), adults (0.30 – 0.46), and elderly (0.345 – 0.40); and essentially similar for newly developed PAQs in youth (0.22 – 0.41), adults (0.27 – 0.28), and the elderly (0.41).

Only four of the reviewed questionnaires, the IPAQ-s (existing) [85], the FPACQ (existing) [111], PDPAR (existing) [60] and the RPAR (new) [21] showed acceptable to good results for both reliability and validity. Sedentary behaviour appeared to be one of the most difficult domains to assess with questionnaires as demonstrated by the poor correlations with objectively measured sedentary time, although arguably, there are also limitations of the criterion measures, which contribute to poorer agreement between methods. About one third (n = 11) of the studies reporting data on newly developed PAQs assessed both validity and reliability for sedentary behaviour. 17 and 15 studies reported data on validity and reliability for sedentary behaviour from existing PAQs, respectively.

Accuracy of PA recall may be increased at the second retest administration by an increased physical activity awareness as a result of completing the questionnaire previously [105]. Many of the reviewed studies did not specify details about their reliability testing, making it difficult to distinguish test-retest reliability of the instrument from a measure of stability of physical activity. It is therefore complex to assign the correlations to either the reliability of the instrument or to the stability of the behaviour of the participant. Assessing test-retest reliability for a last seven day PAQ is generally more straight forward compared to a PAQ assessing usual or last year physical activity. This is because when examining the reliability of a last seven days instrument the respondents should be prompted to report their PA during exactly the same week at two different occasions separated in time. However, this must be weighed against administering the test and retest too close in time that the respondent remembers the answers given to the first administration, resulting in inflation of reliability estimates from correlated error. Several other study details than timeframe of recall can be identified to have a marked influence on the study results, such as socio-cultural background, sex, age, literacy, and cognitive abilities.

The DLW method is usually considered the most accurate criterion method available for measuring TEE and PAEE. However, as discussed above, when using the DLW method and other objective methods which provide outputs in TEE as the criterion instrument, individual variability in body weight needs to be considered. It is therefore recommended that data from these methods should be expressed as PAEE, with and without normalisation for body weight in subsequent validation studies. Combined heart rate and movement sensing may be more accurate than either of the methods used alone for measuring time spent at different intensity levels [31]. However, most of the newly developed PAQs used a single accelerometer mounted at the hip as the criterion method, possibly due to its reasonable costs and feasibility in large study groups. Accelerometry also has some inherent limitations including its inability to accurately assess the intensity of specific types such as weight-bearing activities, cycling, and swimming [33]. Further, the choice of somewhat arbitrary cut-off points [127129] to classify intensities of activity when using accelerometry as a criterion method has been documented before. The use of accelerometers is especially problematic to validate time spent in different intensities of physical activity from PAQs and this also hampers comparison of studies [33]. Usually criterion measures assess overall PA (e.g. time in MVPA, PAEE) which precludes a direct test of the validity of self-reported domain specific activity (e.g. occupation). It is therefore not surprising that some PAQs [e.g. 86] which only asses a specific domain of activity demonstrate low validity when compared with overall physical activity from the criterion instrument. More research is therefore needed to compare time stamped criterion data with domain specific self-reported activity and to develop criterion instruments which can accurately categorise types of activities. Adopting a conceptual framework for physical activity [130] in combination with standardized procedures when developing and validating PAQs [122, 123] is highly recommended.

Pearson and Spearman correlations may not be the most appropriate statistical methods to use for reporting results on the validity of PAQs. ICC is considered a more appropriate method for continuous measures on the same scale, whereas weighted kappa is a better choice of method for categorical measures [131, 132]. When reporting validation results researchers are encouraged to report absolute validity in terms of mean bias with limits of agreement as well as the error structure of the instrument across the measurement range. We noted that many of the newly developed instruments reported results on absolute validity by means of the Bland-Altman method, which is a simple, intuitive and easy to interpret method to analyse assess measurement error [133]. Descriptive details of the study population may be helpful to explain any heterogeneity in the findings from different studies. Researchers can individually interpret all data for quality and applicability.

In summary, we systematically reviewed studies assessing both reliability and validity of PAQs in various domains, across age groups, and with a focus on total PA and sedentary time. PAQs are inherently subject to many limitations and the choice of PAQs should be dictated by the research question and the population under study. Considerations for researchers when using PAQs in practice have been identified and new research should consider including an objective method for assessing physical activity in addition to any self-report [134]. This review has identified a limited number of PAQs that appear to have both acceptable reliability and validity. Newly developed PAQs do not appear to perform substantially better than existing PAQs in terms of reliability and validity.