FormalPara Key Points

No conclusive evidence was found for both the validity and reliability for any of the included physical activity questionnaires for youth.

High-quality studies on the measurement properties of the most promising physical activity questionnaires are urgently needed, e.g., by using the COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist.

More attention on the content validity of physical activity questionnaires is needed to confirm that questionnaires measure what they intend to measure.

1 Introduction

Numerous studies have demonstrated beneficial effects of physical activity, in particular of moderate to vigorous intensity, on metabolic syndrome, bone strength, physical fitness, and mental health in children and adolescents [1, 2]. In order to monitor trends in physical activity, examine associations between physical activity and health outcomes, and evaluate the effectiveness of physical activity-enhancing interventions, valid, reliable, responsive, and feasible measures of physical activity are needed.

Accelerometers are considered to provide valid and reliable measures of physical activity in children and adolescents [3]. However, accelerometers are not gold standard and underestimate activities such as cycling, swimming, weight lifting, and many household chores. Moreover, physical activity estimates vary depending on subjective decisions in data reduction such as the choice of cut-points for intensity levels, the minimum number of valid days, the minimum number of valid hours per day, and the definition of non-wear time [4]. Furthermore, accelerometers cannot provide information on the type and context of the behavior and are labor-intensive and costly, especially in large populations [5].

Self-report or proxy-report questionnaires are seen as a convenient and affordable way to assess physical activity that can provide information on the context and type of the activity [5, 6]. However, questionnaires have their limitations as well, such as the potential for social desirability and recall bias [6, 7]. Thus, for measuring physical activity a combination of the more objective measures such as accelerometers and self-report questionnaires seems most promising.

A great many questionnaires measuring physical activity in children and adolescents have been developed, with varying formats, recall periods, and types of physical activity recalled. To be able to select the most appropriate questionnaire, an overview of the measurement properties of the available physical activity questionnaires in children and adolescents is highly warranted. In 2010, Chinapaw et al. [8] reviewed the measurement properties of self-report and proxy-report measures of physical activity in children and adolescents. As many studies assessing measurement properties of physical activity questionnaires have been published since then, an update is timely.

Therefore, we aimed to summarize studies that assessed the measurement properties (e.g., responsiveness, reliability, measurement error, and validity) of self-report or proxy-report questionnaires in children and adolescents under the age of 18 years published since May 2009. Furthermore, we aimed to provide recommendations regarding the best available questionnaires, taking into account the best available questionnaires from the previous review.

2 Methods

This review is an update of the previously published review of Chinapaw et al. [8]. We followed the Preferred Items for Systematic Reviews and Meta-Analyses (PRISMA) reporting guidelines and registered the review on PROSPERO (international prospective register of systematic reviews; registration number: CRD42016038695).

2.1 Literature Search

Systematic literature searches were carried out in PubMed, EMBASE, and SPORTDiscus (from January 2009 up until April 2018). In PubMed more overlap in time was maintained (search from May 2008), as our previous searches showed that the PubMed time filter can be inaccurate, e.g., due to incorrect labeling of publication dates. The full search strategy can be found in the Electronic Supplementary Material (Online Resource 1).

Search terms in PubMed were used in AND-combination, and related to physical activity (e.g., motor activity, exercise), children and adolescents (e.g., schoolchildren, adolescents), measurement properties (e.g., reliability, reproducibility, validity) [9], and self- or proxy-report measures (e.g., child-reported questionnaire). Medical Subject Heading (MESH), title and abstract (TIAB), and free-text search terms were used, and a variety of publication types (e.g., biography, comment, case reports, editorial) were excluded. In EMBASE, search terms related to physical activity, measurement properties [9], and self- or proxy-report measures were used in AND-combination. The search was limited to children and adolescents (e.g., child, adolescent), and EMBASE-only. EMBASE subject headings, TIAB, and free-text search terms were used. In SPORTDiscus, TIAB and free-text search terms were used in AND-combination, related to physical activity, children and adolescents, and self- or proxy-report measures.

2.2 Inclusion and Exclusion Criteria

Studies were eligible for inclusion when (1) the aim of the study was to evaluate at least one of the measurement properties of a self-report or proxy-report physical activity questionnaire, or a questionnaire containing physical activity items; (2) the questionnaire under study at least reported data on the duration or frequency of physical activity; (3) the mean age of the study population was < 18 years; and (4) the study was available in the English language. Studies were excluded in the following situations: (1) studies assessing physical activity using self-report measures administered by an interview (one-on-one assessment) or using a diary; (2) studies evaluating the measurement properties in a specific population (e.g., children who are affected by overweight or obesity); (3) studies examining structural validity and/or internal consistency for questionnaires that represent a formative measurement model; (4) construct validity studies examining the relationship between the questionnaire and a non-physical activity measure, e.g., body mass index (BMI) or percentage body fat; and (5) responsiveness studies that did not use a physical activity comparison measure, e.g., accelerometer, to assess a questionnaire’s ability to detect change.

2.3 Selection Procedures

Titles and abstracts were screened for eligible studies by two independent researchers [Lisan Hidding (LH) and either Mai Chinapaw (MC), Mireille van Poppel (MP), Teatske Altenburg (TA), or Lidwine Mokkink (LM)]. Subsequently, full texts were obtained and screened for eligibility by two independent researchers (LH and either TA or MP). A fourth researcher (MC) was consulted in the case of doubt.

2.4 Data Extraction

For all eligible studies, two independent reviewers (LH and either TA or MP) extracted data regarding the characteristics of studies and results of the assessed measurement properties, using a structured form. Extracted data regarding the methods and results of the assessed measurement properties included study population, questionnaire under study, studied measurement properties, comparison measures, time interval, statistical methods used, and results regarding the studied measurement properties. In the case of disagreement regarding data extraction, a fourth researcher (MC) was consulted.

2.5 Methodological Quality Assessment

Two independent reviewers (LH and either MC or LM) rated the methodological quality of the included studies using the standardized COnsensus-based Standards for the selection of health Measurement INstruments (COSMIN) checklist [10,11,12]. For each measurement property, the design requirements were rated using a 4-point scale (i.e., excellent, good, fair, or poor). The lowest score counts method was applied, e.g., the final methodological quality was scored as poor in the case of a poor score on one of the items. The lowest rated items that determined the final score for each study are shown in Electronic Supplementary Material Online Resource 2. The methodological quality of the content validity studies was not assessed as often little or no information on the development of the questionnaire or on the assessment of relevance, comprehensiveness, and comprehensibility of items was available. One minor adaption to the original COSMIN checklist, also described in a previous review [13], was applied: Percentage of Agreement (PoA) was removed from the reliability box and added to the measurement error box as an excellent statistical method [14]. To assess the methodological quality of test–retest reliability studies, standards previously described by Chinapaw et al. [8] regarding the time interval were applied: between > 1 day and < 3 months for questionnaires recalling a standard week; between > 1 day and < 2 weeks for questionnaires recalling the previous week; and between > 1 day and < 1 week for questionnaires recalling the previous day.

2.6 Questionnaire Quality Assessment

2.6.1 Reliability

Reliability is defined as “the degree to which a measurement instrument is free from measurement error” [15]. Test–retest reliability outcomes were considered acceptable under the following conditions: (1) intraclass correlation coefficients and kappa values ≥ 0.70 [16]; or (2) Pearson, Spearman, or unknown correlations ≥ 0.80 [17]. Measurement error is defined as “the systematic and random error of a score that is not attributed to true changes in the construct” [15]. Measurement error outcomes were considered acceptable when the smallest detectable change (SDC) was smaller than the minimal important change (MIC) [16].

The majority of the included studies reported multiple correlations per questionnaire for test–retest reliability, e.g., separate correlations for each questionnaire item. Therefore, an overall evidence rating was applied in order to obtain a final test–retest reliability rating, incorporating all correlations per questionnaire for each study. A positive (+) evidence rating was obtained if ≥ 80% of correlations were acceptable, a mixed (±) evidence rating was obtained when ≥ 50% and < 80% of correlations were acceptable, and a negative (–) evidence rating was obtained when < 50% of correlations were acceptable. For measurement error, no final evidence rating could be applied, as to our knowledge no information on the MIC is available for the included questionnaires. Furthermore, in the case of PoA, higher scores represent less measurement error.

2.6.2 Validity

For validity, three different measurement properties can be distinguished, i.e., content validity, construct validity, and criterion validity [15]. Content validity is defined as “the degree to which the content of a measurement instrument is an adequate reflection of the construct to be measured” [15]. Construct validity is “the degree to which the scores of a measurement instrument are consistent with (a priori drafted) hypotheses” [15]. Hypotheses can concern internal relationships, i.e., structural validity, or relationships with other instruments. Criterion validity is defined as “the degree to which the scores of an instrument are an adequate reflection of a gold standard” [15].

Content validity could not be assessed, as for most studies a justification of choices, e.g., comprehensibility findings based on input from the target population or experts in the field, were missing. A summary of the studies examining content validity has been added in the results section. Since a priori formulated hypotheses for construct validity were often lacking, in line with previous reviews [13, 18] we formulated criteria with regard to the relationships with other instruments; see Table 1 for criteria. The criteria were subdivided by level of evidence, level 1 indicating strong evidence, level 2 indicating moderate evidence, and level 3 indicating weak evidence. Table 1 also includes criteria for criterion validity, e.g., when doubly labeled water was used as a comparison measure for questionnaires aiming to assess physical activity energy expenditure.

Table 1 Constructs of physical activity measured by the questionnaires evaluating construct and/or criterion validity, subdivided by level of evidence, and criteria for acceptable correlations

Most construct validity studies examined relationships with other instruments, reporting separate correlations for each questionnaire item. As with reliability, an overall evidence rating was applied incorporating all available correlations for each questionnaire per study (i.e., a positive, mixed, or negative evidence rating was obtained). Since no hypotheses were available for mean differences and limits of agreement, only a description of these results is included in the Results section (Sect. 3).

2.7 Inclusion of Results from the Previous Review

To draw definite conclusions regarding the best available questionnaires, the most promising questionnaires based on the previous review [8], i.e., published before May 2009, were also taken into account. As the previous review combined the methodological quality assessment and the questionnaire quality (i.e., results regarding measurement properties) in one rating, we reassessed the methodological and questionnaire quality of these previously published studies. We included only the studies that received a positive rating in the previous review for each measurement property. However, in the previous review, no final rating for measurement error was applied; therefore, all measurement error studies were reassessed and included in the current review. In addition, for construct validity, no final rating was applied in the previous review, as the majority of studies did not formulate a priori hypotheses. We chose to reassess the two studies showing the highest correlations between the questionnaire and an accelerometer, for each age category. The studies below this ‘top 2’ showed such low correlations that they would receive a negative evidence rating using our criteria. Furthermore, we assessed three other studies that formulated a priori hypotheses, as these studies may score higher regarding methodological quality. The reassessed studies are included in Tables 2, 3, 4 in the Results section.

Table 2 Construct validity of physical activity questionnaires for youth sorted by age category, methodological quality, and level of evidence and evidence rating
Table 3 Reliability of physical activity questionnaires for youth sorted by age category, methodological quality, and evidence rating
Table 4 Measurement error of physical activity questionnaires for youth sorted by age category and methodological quality

2.8 Best Evidence

We chose to divide the included studies in three age categories, i.e., preschoolers, children, and adolescents, and draw conclusions on the best available questionnaire(s) for each age category. A questionnaire was considered of interest when at least a fair methodological quality and a positive evidence rating were achieved. Additionally, for construct validity, the level of evidence (see Table 1) was taken into account, so questionnaires with a higher level of evidence comparison measure were considered more valuable. Because no evidence ratings were available for measurement error, these measurement properties were not taken into account when drawing conclusions about the best available questionnaire.

3 Results

Systematic literature searches using the PubMed, EMBASE, and SPORTDiscus databases yielded 15,220 articles after removal of duplicates. After title and abstract screening, 110 eligible articles remained. Another 21 articles were found through cross-reference searches. Therefore, 131 full-text articles were screened, which resulted in the inclusion of 71 articles examining 76 (versions of) questionnaires. After additionally including 16 articles from the previous review, this resulted in 87 articles examining 89 (versions) of questionnaires. See Fig. 1 for the full selection process. Within the 87 articles, 162 studies were conducted, with 103 assessing construct validity, 50 test–retest reliability, and nine measurement error. Four of the included questionnaires were assessed by two of the included studies, i.e., the 3-Day Physical Activity Recall (3DPARecall) [19, 20], the Activity Questionnaire for Adults and Adolescents (AQuAA) [21, 22], the Oxford Physical Activity Questionnaire (OPAQ) [23, 24], and a physical activity, sedentary behavior, and strength questionnaire [25, 26]. Furthermore, two of the questionnaires were assessed by three of the included studies, i.e., the Physical Activity Questionnaire for Older Children (PAQ-C) [27,28,29], and the Previous Day Physical Activity Recall (PDPAR) [30,31,32]. In addition, various modified versions of questionnaires were assessed by the included studies.

Fig. 1
figure 1

Preferred Items for Systematic Reviews and Meta-Analyses (PRISMA) flow diagram of study inclusion

3.1 Construct Validity

The construct validity results are summarized in Table 2. Of the 72 questionnaires that were assessed on construct validity, eight were from the previous review. Fifteen of the questionnaires were assessed by two studies, two were assessed by three studies, one by four, one by five, and one by six studies. Six questionnaires were assessed in preschoolers, 29 in children, and 38 in adolescents (one questionnaire was assessed in both children and adolescents). The methodological quality rating of the construct validity studies ranged from poor to good: 49 studies received a poor, 49 a fair, and five a good rating. The low methodological scores were predominantly due to comparison measures with unacceptable or unknown measurement properties, and a lack of a priori formulated hypotheses. No definite conclusion could be drawn regarding the best available questionnaires for preschoolers, as studies on construct validity within this age category were of low methodological quality or received negative evidence ratings. For children, the best available questionnaire was found to be the Godin Leisure-Time Exercise Questionnaire [63] (fair methodological quality and positive level 2 evidence). Although the moderate level 2 evidence hampered our ability to draw conclusions on the validity, it is worthwhile to investigate further. We concluded that the most valid questionnaire in adolescents was the Greek version of the 3-Day Physical Activity Record (3DPARecord) [33] (fair methodological quality and positive level 1 evidence rating). Note that the 3DPARecord uses a different format (i.e., different time segments and categories) than the frequently used 3DPARecall.

3.2 Content Validity

Six of the included questionnaires were qualitatively assessed on content validity, one of which was assessed by two studies [25, 26, 34,35,36,37]. Studies used cognitive interviews, semi-structured interviews, and focus groups with children and adolescents and/or experts (e.g., researchers in the field of sports medicine, pediatrics, and measurement) to assess the comprehensibility, relevance of items, and comprehensiveness of the questionnaires. Due to a lack of details on the methods used regarding testing or developing these questionnaires, the methodological quality of these studies and the quality of the questionnaires could not be assessed. Ten of the included questionnaires were pilot-tested with children and/or parents on, for example, comprehensiveness and time to complete [33, 38,39,40,41,42,43,44,45]. However, again, the study quality could not be assessed due to the minimal amount of information provided. Lastly, 15 of the questionnaires were translated versions [33, 35, 39, 40, 43, 46,47,48,49,50,51,52,53]; the majority of these studies provided little information on the translation processes. These studies did not assess the cross-cultural validity, and thus no definite conclusion about the content validity of the translated questionnaires could be drawn.

3.3 Test–Retest Reliability

The test–retest reliability results are summarized in Table 3. Of the 46 questionnaires assessed on test–retest reliability, five were from the previous review. Four of the questionnaires were assessed by two studies. Five questionnaires were assessed in preschoolers, 16 in children, and 26 in adolescents (one questionnaire was assessed in both children and adolescents). The methodological quality of the studies was rated as follows: 13 scored poor, 26 fair, and 11 good. The majority of poor and fair scores were due to the lack of a description about how missing items were treated and inappropriate time intervals between test and retest. The most reliable questionnaire in preschoolers was the Energy Balance Related Behaviors (ERBs) self-administered primary caregivers questionnaire (PCQ) [46] (fair methodological quality and positive evidence rating). In children, the most reliable questionnaires were the Chinese version of the PAQ-C [43], and the Active Transportation to school and work in Norway (ATN) questionnaire [41] (both good methodological quality and positive evidence rating). The most reliable questionnaires in adolescents were a single-item activity measure [23], and the Web-based and paper-based PAQ-C [28] (both good methodological quality and positive evidence rating).

3.4 Measurement Error

Table 4 summarizes the measurement error outcomes. Of the nine questionnaires assessed on measurement error, two were from the previous review. One questionnaire was assessed in preschoolers, three in children, and five in adolescents. Four of the studies received a good methodological quality rating, and five received a fair one. Fair scores were predominantly due to the lack of a description about how missing items were treated.

4 Discussion

This review summarizes studies that assessed the measurement properties of physical activity questionnaires for children and adolescents under the age of 18 years. Questionnaires varied in (sub)constructs measured, recall periods, number of questions and format, and different measurement properties that were assessed, e.g., construct validity, test–retest reliability, or measurement error. Unfortunately, most studies had low methodological quality scores and low evidence ratings, especially for construct validity. Additionally, no questionnaire was identified with both high methodological quality and positive evidence ratings for reliability and validity. Furthermore, for the majority of questionnaires there was a lack of data on both reliability and validity. Consequently, no definite conclusion regarding the most promising questionnaire can be drawn.

4.1 Construct Validity

For adolescents, one valid questionnaire was found, i.e., the Greek version of the 3DPARecord [33]. The 3DPARecord is a questionnaire using a segmented day structure that divides the previous 3 days (1 weekend day) into timeframes of 15 min each, with the adolescents reporting their activity using nine categories ranging from 1 (sleep) to 9 (vigorous physical activity and sport) for each of the timeframes [33].

Due to the predominantly low methodological study quality and negative evidence ratings for study results in children and preschoolers, no valid questionnaires were identified. The low methodological quality of the studies was predominantly due to a lack of a priori formulated hypotheses and the use of comparison measures with unknown or unacceptable measurement properties. Moreover, in some studies comparisons between non-corresponding constructs were made, e.g., moderate to vigorous physical activity (MVPA) measured by a questionnaire compared with total accelerometer counts.

4.2 Test–Retest Reliability and Measurement Error

For preschoolers, one reliable questionnaire was identified: the ERBs self-administered PCQ [46]; two reliable questionnaires were identified for children: the Chinese version of the PAQ-C [43] and the ATN questionnaire [41]; and two for adolescents: a single-item activity measure [23] and the web- and paper-based PAQ-C [28].

Many questionnaires received a positive evidence rating but due to the low methodological quality of the studies no definite conclusions regarding their reliability could be drawn. The low methodological quality was mainly due to inappropriate time intervals between test and retest, and the lack of a description about how missing items were handled. Unfortunately, no final evidence rating for measurement error could be computed as none of the studies provided information on the MIC.

4.3 Strengths and Limitations

A strength of this review is the separate assessment of the questionnaire quality (i.e., results for measurement properties) and the methodological quality of the study in which the questionnaire was assessed. This provides transparency in the conclusion regarding the best available questionnaires. Furthermore, data extraction and assessment of methodological quality were carried out by at least two independent researchers, minimizing the chance of bias. In addition, cross-reference searches were carried out, thereby increasing the likelihood of finding all relevant studies. However, we only included English-language studies, disregarding relevant studies published in other languages.

4.4 Recommendations for Future Research

Due to the methodological limitations of existing studies, we cannot draw definite conclusions on the measurement properties of physical activity questionnaires. This hampers the identification of the most suitable questionnaires for assessing physical activity in children. To improve future research we recommend the following:

  • Using standardized tools for the evaluation of measurement properties such as COSMIN, to improve the quality of studies examining measurement properties [11, 54];

  • Using appropriate translation methods [17];

  • Using the mode of administration in a validation study that is intended in the field;

  • Defining the context of use and the measurement model of the questionnaire to determine which measurement properties are relevant to examine;

  • Conducting more studies assessing content validity to ensure questionnaires are comprehensive and an adequate reflection of the construct to be measured [13, 55];

  • For construct validity, choosing a comparison measure that measures a similar construct and formulating hypotheses a priori;

  • For reliability studies, test and retest should concern the same day/week when recalling a previous day/week;

  • More research on the responsiveness of valid and reliable questionnaires;

  • Building on or improving the most promising existing questionnaires rather than developing new questionnaires;

  • Providing open access to the examined questionnaire; and

  • Editors of journals to request reviewers and authors to use a standardized tool such as COSMIN for studies on measurement properties.

5 Conclusions

Unfortunately, conclusive evidence for both validity and reliability was not found for any of the identified physical activity questionnaires. The lack of high-quality studies examining both the reliability and the validity of a questionnaire hampered the ability to draw definite conclusions about the best available physical activity questionnaire for children and adolescents. Thus, high-quality methodological studies examining all relevant measurement properties are highly warranted. We strongly recommend researchers adopt standardized tools, e.g., the COSMIN methodology [11, 56, 57], for the design and report of future studies. Current studies using physical activity questionnaires should keep in mind that their results may not adequately reflect children’s and adolescents’ physical activity levels, as most questionnaires lack appropriate validity and/or reliability.