Literature Search
The literature search resulted in 1,719 hits. Of these, 27 articles were selected based on titles and abstracts. After reading the full-texts, ten articles were excluded because of the absence of measurement properties (n = 5) [29,30,31,32,33] or using a diary/record (n = 3) [34,35,36] or an interview (n = 2) [37, 38]. Finally, 17 articles [39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55] on 11 different PA questionnaires (17 versions) [39, 44, 56,57,58,59,60,61,62,63] were included (Fig. 1). Overall, these 17 articles reported 18 studies of measurement properties. It should be noted that the studies describing the development of the short and long form of the International Physical Activity Questionnaire (IPAQ) [59] share the same reference in order to avoid any misconceptions. All results are presented for questionnaires developed for the pregnant and non-pregnant population separately only to improve readability.
Table 2 shows a summary of all included articles and questionnaires in combination with evaluated measurement properties and study populations. Construct validity was assessed for all questionnaires, whereas reliability (parameters of reliability and measurement error) was assessed for six questionnaires (11 versions) and responsiveness for two questionnaires. In most studies, an accelerometer was used as a comparison measure. Eight studies [42,43,44,45,46, 49, 51, 55] assessed the measurement properties of the Pregnancy Physical Activity Questionnaire (PPAQ) [44] or adaptations of this questionnaire (e.g., Japanese version). Another study [48] evaluated the long form of the IPAQ, whereas two studies (of reliability and construct validity), reported in one article [52], evaluated the short form of the IPAQ (IPAQ-SF). One study [39] used a strongly modified version of the IPAQ measuring leisure time (LT) PA (LTPA) in pregnancy. One article [40] reported one study evaluating two questionnaires, namely the Australian Women’s Activity Study (AWAS) [60] and the Recent Physical Activity Questionnaire (RPAQ) [57].
Table 2 Explanation of acronyms or abbreviated names of questionnaires, studies on measurement properties and sample characteristics Description of Questionnaires
A detailed description of the questionnaires is shown in Table 3. Of the 11 questionnaires, four were developed to assess PA in pregnant women [39, 44, 62, 63], whereas five were developed for adults [56, 57, 59, 61], one for adults and adolescents [58], and one for women with young children [60].
Table 3 Description of PA questionnaires
Of the seven questionnaires that were developed for the non-pregnant population, six (Activity Questionnaire for Adolescents and Adults [AQuAA], AWAS, Global Physical Activity Questionnaire [GPAQ], IPAQ, IPAQ-SF, RPAQ) aim to measure the construct PA and one (Leisure-Time Exercise Questionnaire [LTEQ]) measures LT exercise. When assessing (total) PA, the AQuAA, AWAS, GPAQ, IPAQ, and RPAQ cover all relevant settings of PA (home, recreation, sports, transport, work). The GPAQ assesses sport-related PA within discretionary time (leisure, recreation, sports). Likewise, the RPAQ assesses sport-related PA such as competitive running and swimming in its section on recreation. The AWAS assesses planned activities (including sports, leisure, recreation) and was developed to measure PA in women with young children, and therefore focuses particularly on childcare activities and domestic responsibilities. The IPAQ-SF aims to cover all settings of PA without discriminating between them. Most of the questionnaires use a typical week or the last week as a recall period and the number of questions varies from seven (IPAQ-SF) to 68 (AWAS). Duration, frequency, and intensity of PA are obtained by all questionnaires except LTEQ, which only collects frequency and intensity. Usually, both a total PA score and separate scores for time spent in different intensity levels (e.g., light PA, VPA) as well as sedentary behavior (SB) can be calculated using minutes per day/week, MET min per week or frequency per week as units of measurement. In addition, GPAQ, IPAQ, and RPAQ provide separate PA scores for different settings.
Of the four questionnaires developed for the pregnant population, PA is measured with reference to the specific trimester (Physical Activity and Pregnancy Questionnaire [PAPQ], PPAQ), the last 2 weeks (Leisure-Time Physical Activity Questionnaire [LTPAQ]) [39] or since becoming pregnant (Questionnaire of recreational exercise from Norwegian Mother and Child Cohort Study [Q1 of MoBa]) [63]. PAPQ and PPAQ aim to measure the construct (total) PA, whereas LTPAQ and Q1 of MoBa aim to measure LTPA or recreational exercise during pregnancy. The LTPAQ was based on the IPAQ but was strongly modified to provide a better discrimination between the structured (LT excluding household) and unstructured (household) features of PA. Parameters of duration, frequency, and intensity of PA are assessed by all questionnaires except Q1 of MoBa. Scores for total PA, time spent in light PA, moderate PA, VPA, and SB can be calculated for the PAPQ, PPAQ, and LTPAQ. For Q1 of MoBa, only a total PA score can be calculated. All four questionnaires use minutes per week or MET min/week to calculate PA scores.
Finally, all questionnaires that assigned MET intensities for activities use compendium-based information about intensities for different activities [64]. These MET intensities are based on the general population, including men and non-pregnant women. In contrast, the PPAQ uses pregnancy-specific MET intensities whenever possible, such as for walking and light-to-moderate intense household activities [44].
Assessment of Measurement Properties
Content Validity
A comprehensive evaluation of the content validity of PA questionnaires during pregnancy was not part of this review. Consequently, no included study assessed the content validity in a methodological approach but some provided information on content validity. During the development of the PPAQ, one study [44] used 24-h recalls to select both prevalent and discriminatory activities of pregnant women. The findings of the study showed that watching television, standing or slowly walking at work while carrying light/moderate loads, and childcare were the most relevant activities. Another study [54] discussed the content validity of the GPAQ theoretically in the context of previous research and expert opinions. Their conclusion was that the GPAQ includes important settings (e.g., work, transport, leisure) and scores (frequency, duration, intensity) of PA but including pregnancy-specific activities (and settings) such as caregiving might result in a better content validity. Furthermore, one study [39] of the LTPAQ strongly modified the IPAQ to provide a better discrimination between the structured (LT excluding household) and unstructured (household) features of PA. They excluded occupational PA and used the degree of breathlessness (none, some, strong) instead of light, moderate, and vigorous to describe the intensity of activities, which may result in a better understanding for some women. Finally, studies of adaptations of the PPAQ [43, 45, 49, 51, 55] included expert opinions and pilot studies to assess content validity and, consequently, items were modified and/or deleted during their cross-cultural validation process.
According to our criterion (i) (see Sect. 2.5.1), of those questionnaires that aim to measure total PA, AQuAA, AWAS, GPAQ, IPAQ, IPAQ-SF, PAPQ, and PPAQ cover all relevant settings of PA. The RPAQ does not collect information on household-related activities [57] since the authors showed in a previous study [65] that these activities were inversely correlated with objectively measured PA. Therefore, they only included a few activities such as stair-climbing at home, mowing the lawn, watering the lawn or garden, or home maintenance. The IPAQ-SF aims to cover all settings of PA, but domain-specific scores cannot be obtained. The LTEQ, LTPAQ, and Q1 of MoBa were developed to collect specific information about LT/recreational exercise and LTPA rather than total PA. According to criterion (ii) (see Sect. 2.5.1), all included questionnaires assess frequency and duration of PA except LTEQ and Q1 of MoBa and no questionnaire uses a recall period of less than 1 week. In sum, the AQuAA, AWAS, GPAQ, IPAQ, IPAQ-SF, LTPAQ, PAPQ, and PPAQ provided sufficient content validity for the assessment of PA during pregnancy, whereas LTEQ, Q1 of MoBa, and RPAQ did not.
Reliability
The results for reliability (parameters of reliability and measurement error) of ten studies of six questionnaires (11 versions) are summarized in Table 4. Of the questionnaires developed for the non-pregnant population, the IPAQ-SF [52] showed sufficient reliability for all estimates of PA, the LTEQ [53] for strenuous LT exercise but not for total, mild, and moderate LT exercise, and the RPAQ [40] showed sufficient reliability for moderate PA but insufficient reliability for all other estimates of PA. The AWAS [40] showed insufficient reliability (ICC < 0.70).
Table 4 Parameters of reliability and measurement error of PA questionnaires during pregnancy
Of the questionnaires developed for the pregnant population, parameters of reliability and measurement error were only assessed for (versions of) the PPAQ and LTPAQ. In sum, studies of the English [44], Turkish [45], and Vietnamese versions [51] of the PPAQ showed sufficient reliability. The Chinese version [55] showed sufficient reliability for all PA scores except moderate PA, VPA, and sports/exercise. The French version of the PPAQ [43] showed sufficient reliability for all scores except for transportational PA and, likewise, the Japanese version [49] for all scores except for transportational PA, sports/exercise, and occupational PA (1-week interval only). Although three studies [39, 49, 51] assessed measurement error, only one study reported LOA or CV for repeated measurements. In particular, the results for the LTPAQ [39] were insufficient because of large LOA (MICfrequency/duration < LOA/SDC) and CV. These values indicate large measurement errors and hamper a reliable detection of MIC of PA (e.g., two sessions or 30 min of MVPA per week) [17].
Construct and Criterion Validity
The results for construct validity are summarized in Table 5. Of the 11 different questionnaires, construct validity was mostly assessed by validation against accelerometers and less often against pedometers, logbooks, or other PA questionnaires.
Table 5 Construct validity and responsiveness of PA questionnaires during pregnancy
Of the seven questionnaires developed for the non-pregnant population, the AQuAA [50], AWAS [40], GPAQ [54], IPAQ [48], IPAQ-SF [52], and LTEQ [53] showed insufficient construct validity because of low coefficients or large disagreements (e.g., wide LOA). The RPAQ [40] showed a sufficient correlation with PA estimates from the accelerometer for total active time (r ≥ 0.50) but not for total physical activity energy expenditure (PAEE) and other estimates of PA.
Of the four questionnaires developed for the pregnant population, the LTPAQ [39] showed insufficient construct validity. The ratings for the PAPQ [47] were insufficient for light and moderate PA but sufficient for VPA. However, the LOA indicated large disagreement between PAPQ and accelerometry in assessing VPA. The results of studies of the construct validity of (versions of) the PPAQ were predominantly insufficient, such as for the Vietnamese [51], Japanese [49], English [44, 46], Chinese [55], and bilingual [46] versions of the questionnaire. Likewise, the second study [42] of the English version revealed insufficient construct validity for all scores expect for LT-MVPA. The Turkish version of the PPAQ [45] showed sufficient validity for the assessment of total PA due to a high correlation with the pedometer but insufficient ratings for all other estimates. The French version of the PPAQ [43] received sufficient ratings for total, light, and moderate PA, household/caregiving and occupational but insufficient ratings for sports/exercise, vigorous, and transportational PA. Finally, Q1 of MoBa [41] showed insufficient construct validity. There was a low correlation (r < 0.50) between sum of weekly exercise estimated from the questionnaire and VPA estimated from the accelerometer.
Responsiveness
Only two studies examined responsiveness for two questionnaires (see Table 5). The AQuAA [50] showed insufficient responsiveness. Similarly, the GPAQ [54] showed insufficient responsiveness because of large disagreements (large LOA) between the questionnaire and accelerometer. Moreover, the GPAQ showed both systematic (difference in intercepts) and proportional differences (difference in slopes) regarding the change in MVPA between 14–18 and 29–33 weeks of gestation as indicated by Passing Bablok regression [54].
Quality of Individual Studies
Regarding the assessment of reliability of each PA score, nine studies [39, 40, 43,44,45, 49, 51, 52, 55] of AWAS, IPAQ-SF, LTPAQ, PPAQ, and RPAQ were at the highest level of quality (level 1) and one study [53] of the LTEQ at level 3 because of use of Pearson correlations and an inadequate time interval between test and retest. Regarding construct validity, six studies [40, 41, 47, 50, 52, 54] of AQuAA, AWAS, GPAQ, IPAQ-SF, PAPQ, and Q1 of MoBa were at the highest level of quality (level 1), four studies [40, 43, 44, 55] of PPAQ and RPAQ at level 1 and 2, one study [42] of PPAQ at level 1 and 3, and six studies [39, 45, 46, 49, 51, 53] of LTEQ, LTPAQ, and PPAQ at level 3 (see Table 5). The quality of one study of the IPAQ was either of level 1, level 2, or level 3 depending on the evaluated PA score [48]. Different levels of quality were assigned due to comparisons with either objective (e.g., accelerometer, pedometer) or subjective (e.g., logbook, questionnaire) measures of PA or comparisons between different intensity levels. For example, a lower level of quality was assigned if light PA measured by the questionnaire was compared with MVPA measured by the accelerometer (e.g., Japanese version of the PPAQ) [49] or if PA measured by the questionnaire was compared with pedometer measured daily steps (e.g., LTEQ) [53]. Furthermore, the quality for the assessment of total PA was often of level 2 because total PAEE estimated from the questionnaires was compared against accelerometer estimated total counts. Responsiveness was evaluated in two studies [50, 54] for two questionnaires (AQuAA, GPAQ). The quality of these studies was rated as level 1.
Finally, almost none of the studies formulated a priori hypotheses about expected results for construct validity or responsiveness. Only two studies [50, 52] of the AQuAA and IPAQ-SF considered a minimum correlation of r = 0.5 as an adequate agreement between PA questionnaire and accelerometer.
Quality of Evidence
Table 6 summarizes the overall results (i.e., sufficient/insufficient measurement properties) and quality of evidence (GRADE) for three PA scores; total PA, MVPA, and VPA (per questionnaire and measurement property). None of the questionnaires provided evidence for all the relevant measurement properties (i.e., reliability [parameters of reliability or measurement error], construct validity, responsiveness). Only for the AWAS, IPAQ-SF, LTEQ, LTPAQ, PPAQ (i.e., Chinese, English, French, Japanese, Turkish, Vietnamese versions), and RPAQ was both reliability and construct validity assessed. Because there was usually only one study per questionnaire version and PA score available (except PPAQ), inconsistency could not be evaluated for these studies. With reference to the eligibility criteria and the checklist for methodological quality, we identified no serious indirectness, and therefore, did not downgrade the quality of evidence for any of the PA scores due to this factor.
Table 6 GRADE evidence profile: measurement properties of PA questionnaires for the assessment of total PA, MVPA and VPA during pregnancy
Overall and irrespective of the reported results (i.e., sufficient/insufficient measurement properties), the quality of the body of evidence was limited and ranged from very low to moderate. There was no high-quality evidence indicating that any of the included questionnaires had sufficient measurement properties in assessing total PA, MVPA, or VPA. Only the Turkish and French versions of the PPAQ showed both sufficient reliability and construct validity when assessing total PA (but not MVPA and VPA), but these results were based on low-to-moderate quality evidence.
Although different language versions of questionnaires should be treated initially separately [26], one may consider pooling the results (i.e., body of evidence) of the different versions of the PPAQ. When doing so, there was high-quality evidence (no serious risk of bias, no serious imprecision, no serious inconsistency, no serious indirectness) that the PPAQ had sufficient reliability in assessing total PA and VPA. We did not consider downgrading the quality of evidence for VPA as most of the results were sufficient (four of five studies), except the Chinese version, which may have occurred because most women did not engage in these activities, as suggested by the authors [55].
The results for construct validity of the PPAQ were inconsistent for total PA (i.e., two studies showed sufficient and five studies insufficient results) and consistently insufficient for VPA (see Table 6). When pooling these results, the PPAQ showed insufficient validity in assessing total PA, which was based on low-quality evidence (serious risk of bias, serious inconsistency, no serious imprecision, no serious indirectness). Similarly, there was moderate-quality evidence that the PPAQ has insufficient validity in assessing VPA (serious risk of bias, no serious inconsistency, no serious imprecision, no serious indirectness). We could not pool the results for MVPA and other measurement properties such as measurement error and responsiveness of the PPAQ due to a lack of multiple studies.